Summary
turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) has fallen from 1.76x of f16 (April) to 0.45x of f16 (May canonical). This is not a regression in turbo code — it's that upstream MoE optimizations benefit the MMA FA path (used by f16) but not the turbo code path.
Data
April (commit ~6545786, turbo-flash-cuda build):
| Config |
pp512 |
tg128 |
| f16/f16 |
1,463 |
56.4 |
| turbo2/turbo2 |
156 |
99.5 (1.76x f16) |
May (commit 69d8e4b, canonical tip):
| Config |
pp512 |
tg128 |
| f16/f16 |
5,328 |
241.5 |
| turbo2/turbo2 |
158 |
108 (0.45x f16) |
f16 improved 4.28x. turbo2 improved 9%. The gap will keep widening as upstream MoE/MLA optimizations continue.
Root Cause Analysis
f16 path (fast)
fattn.cu dispatch → D=576, V=512 → MMA FA kernel with ncols2=16 (full GQA optimization, gqa_ratio=16). No dequant needed. Benefits directly from upstream MMA data loading optimizations (4-byte → 16-byte granularity).
turbo2 path (slow)
fattn.cu dispatch → D=640 (zero-padded from 576) → MMA FA kernel, but:
- Full KV cache dequant every decode step (
fattn-common.cuh:~1340): K_f16.alloc(ggml_nelements(K)) + to_fp16(). Converts entire turbo2 K and V to f16 temp buffers for every single decode token.
- D=640 path capped at ncols1=2 due to shared memory limits (~line 208 in
fattn-mma-f16.cuh), while D=576 gets ncols1 up to 32. Much less parallelism.
- 11% wasted compute from zero-padding (64 extra elements per head that are always zero).
Why turbo2 was FASTER in April
The pre-optimization MMA kernel was slow enough that turbo2's 6.4x smaller KV cache created bandwidth savings that outweighed the dequant overhead. On a 12GB card with a 9.7GB model, this was significant. The upstream optimizations removed the f16 bottleneck but not the turbo dequant bottleneck.
Potential Fixes
-
Extend VEC FA to D=640 (1-2 days): VEC does inline dequant with zero temp buffer overhead. Currently limited to Q->ne[0] <= 256. The VEC template already supports D=640 (D % (2*WARP_SIZE) == 0 passes). Would need new dispatch entries and template instances.
-
Fused dequant-MMA kernel (1-2 weeks): Read turbo2 blocks directly in the MMA kernel, dequant to f16 in shared memory. Avoids temp buffer entirely. Best possible perf but significant kernel work.
-
Strip zero-padding post-dequant (2-4 days): Dequant turbo2 K to 576 f16 elements (strip padding), route to D=576 MMA path with full GQA optimization. Requires dequant-aware WHT inverse.
Impact
This affects all MoE/MLA models with non-128-dim heads where zero-padding is needed. As upstream continues optimizing the f16 MMA path, the gap will widen further. turbo's value on MoE is currently limited to memory compression for longer contexts — it no longer provides a decode speed advantage.
Environment
- Hardware: RTX 3080 Ti (SM 8.6, 12GB VRAM)
- Build:
69d8e4be4, -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON
- Model: DeepSeek-Coder-V2-Lite-Instruct Q4_K_M
Summary
turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) has fallen from 1.76x of f16 (April) to 0.45x of f16 (May canonical). This is not a regression in turbo code — it's that upstream MoE optimizations benefit the MMA FA path (used by f16) but not the turbo code path.
Data
April (commit ~6545786, turbo-flash-cuda build):
May (commit 69d8e4b, canonical tip):
f16 improved 4.28x. turbo2 improved 9%. The gap will keep widening as upstream MoE/MLA optimizations continue.
Root Cause Analysis
f16 path (fast)
fattn.cudispatch → D=576, V=512 → MMA FA kernel withncols2=16(full GQA optimization,gqa_ratio=16). No dequant needed. Benefits directly from upstream MMA data loading optimizations (4-byte → 16-byte granularity).turbo2 path (slow)
fattn.cudispatch → D=640 (zero-padded from 576) → MMA FA kernel, but:fattn-common.cuh:~1340):K_f16.alloc(ggml_nelements(K))+to_fp16(). Converts entire turbo2 K and V to f16 temp buffers for every single decode token.fattn-mma-f16.cuh), while D=576 gets ncols1 up to 32. Much less parallelism.Why turbo2 was FASTER in April
The pre-optimization MMA kernel was slow enough that turbo2's 6.4x smaller KV cache created bandwidth savings that outweighed the dequant overhead. On a 12GB card with a 9.7GB model, this was significant. The upstream optimizations removed the f16 bottleneck but not the turbo dequant bottleneck.
Potential Fixes
Extend VEC FA to D=640 (1-2 days): VEC does inline dequant with zero temp buffer overhead. Currently limited to Q->ne[0] <= 256. The VEC template already supports D=640 (
D % (2*WARP_SIZE) == 0passes). Would need new dispatch entries and template instances.Fused dequant-MMA kernel (1-2 weeks): Read turbo2 blocks directly in the MMA kernel, dequant to f16 in shared memory. Avoids temp buffer entirely. Best possible perf but significant kernel work.
Strip zero-padding post-dequant (2-4 days): Dequant turbo2 K to 576 f16 elements (strip padding), route to D=576 MMA path with full GQA optimization. Requires dequant-aware WHT inverse.
Impact
This affects all MoE/MLA models with non-128-dim heads where zero-padding is needed. As upstream continues optimizing the f16 MMA path, the gap will widen further. turbo's value on MoE is currently limited to memory compression for longer contexts — it no longer provides a decode speed advantage.
Environment
69d8e4be4,-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON