the fix replicates biases too if they exist (e.g. Qwen) (#328)

quic-morteza · web-flow · commit 763f9d72e04e · 2025-04-15T20:47:34.000+05:30
The fix takes care of replicating KV heads for models that have biases
in addition to weights (such as Qwen family). The KV replication doubles
the throughput for Qwen/Qwen2.5-1.5B, that has 2-KV, if compiled with
TS4. The script has been successfully tested for Qwen/Qwen2.5-1.5B and
meta-llama/Llama-3.2-1B-Instruct.

Signed-off-by: quic-morteza &lt;quic_morteza@quicinc.com&gt;
diff --git a/scripts/replicate_kv_head/replicate_kv_heads.py b/scripts/replicate_kv_head/replicate_kv_heads.py
@@ -63,6 +63,10 @@ def duplicate_weights_for_linear_layer(
         layer.weight.data = torch.repeat_interleave(
             layer.weight.data.view(orig_kv_heads, head_dim, hidden_size), repeat, 0
         ).view(new_kv_heads * head_dim, hidden_size)
+        if layer.bias is not None:
+            layer.bias.data = torch.repeat_interleave(layer.bias.data.view(orig_kv_heads, head_dim), repeat, 0).view(
+                new_kv_heads * head_dim
+            )
 
 
 def main(args):