Summary
turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) fell from 1.76x of f16 (April) to 0.45x of f16 (May canonical). Upstream MoE optimizations benefit the MMA FA path (f16) but not the turbo code path.
Data
| Build |
f16 tg128 |
turbo2 tg128 |
Ratio |
| April (~6545786) |
56.4 |
99.5 |
1.76x |
| May (69d8e4b) |
241.5 |
108 |
0.45x |
f16 improved 4.28x. turbo2 improved 9%.
Root Cause
f16 path: D=576 → MMA FA with ncols2=16, full GQA optimization, upstream 16-byte load granularity. No dequant.
turbo2 path: D=640 (zero-padded) → MMA FA, but:
- Full KV dequant to f16 temp buffer every decode token (fattn-common.cuh ~line 1340)
- D=640 capped at ncols1=2 (shared memory), vs D=576 gets ncols1 up to 32
- 11% wasted compute from zero-padding
Turbo2 was faster in April only because the pre-optimization MMA was slow enough that turbo2's 6.4x compression relieved bandwidth pressure on the 12GB card. Upstream removed the f16 bottleneck but not the turbo dequant bottleneck.
Potential Fixes
- Extend VEC FA to D=640 (~1-2 days): Inline dequant, no temp buffer. VEC template supports D=640 (640 % 64 == 0). Needs new dispatch entries.
- Fused dequant-MMA (~1-2 weeks): Read turbo2 directly in MMA kernel. Best perf, hardest.
- Strip zero-padding post-dequant (~2-4 days): Route to D=576 MMA path with full GQA.
Impact
Affects all MoE/MLA models with non-128-dim heads. Gap widens as upstream MMA path improves. Turbo value on MoE is now memory compression only, not decode speed.
Hardware: RTX 3080 Ti SM 8.6 | Model: DeepSeek-Coder-V2-Lite Q4_K_M
Summary
turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) fell from 1.76x of f16 (April) to 0.45x of f16 (May canonical). Upstream MoE optimizations benefit the MMA FA path (f16) but not the turbo code path.
Data
f16 improved 4.28x. turbo2 improved 9%.
Root Cause
f16 path: D=576 → MMA FA with ncols2=16, full GQA optimization, upstream 16-byte load granularity. No dequant.
turbo2 path: D=640 (zero-padded) → MMA FA, but:
Turbo2 was faster in April only because the pre-optimization MMA was slow enough that turbo2's 6.4x compression relieved bandwidth pressure on the 12GB card. Upstream removed the f16 bottleneck but not the turbo dequant bottleneck.
Potential Fixes
Impact
Affects all MoE/MLA models with non-128-dim heads. Gap widens as upstream MMA path improves. Turbo value on MoE is now memory compression only, not decode speed.
Hardware: RTX 3080 Ti SM 8.6 | Model: DeepSeek-Coder-V2-Lite Q4_K_M