CUDA: turbo2 decode regression on MoE models — VEC/MMA dispatch gap widens with upstream optimizations

## Summary

turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) has fallen from 1.76x of f16 (April) to 0.45x of f16 (May canonical). This is not a regression in turbo code — it's that upstream MoE optimizations benefit the MMA FA path (used by f16) but not the turbo code path.

## Data

**April (commit ~6545786, turbo-flash-cuda build):**
| Config | pp512 | tg128 |
|--------|-------|-------|
| f16/f16 | 1,463 | 56.4 |
| turbo2/turbo2 | 156 | **99.5 (1.76x f16)** |

**May (commit 69d8e4be4, canonical tip):**
| Config | pp512 | tg128 |
|--------|-------|-------|
| f16/f16 | 5,328 | **241.5** |
| turbo2/turbo2 | 158 | **108 (0.45x f16)** |

f16 improved 4.28x. turbo2 improved 9%. The gap will keep widening as upstream MoE/MLA optimizations continue.

## Root Cause Analysis

### f16 path (fast)
`fattn.cu` dispatch → D=576, V=512 → MMA FA kernel with `ncols2=16` (full GQA optimization, `gqa_ratio=16`). No dequant needed. Benefits directly from upstream MMA data loading optimizations (4-byte → 16-byte granularity).

### turbo2 path (slow)
`fattn.cu` dispatch → D=640 (zero-padded from 576) → MMA FA kernel, but:

1. **Full KV cache dequant every decode step** (`fattn-common.cuh:~1340`): `K_f16.alloc(ggml_nelements(K))` + `to_fp16()`. Converts entire turbo2 K and V to f16 temp buffers for every single decode token. 
2. **D=640 path capped at ncols1=2** due to shared memory limits (~line 208 in `fattn-mma-f16.cuh`), while D=576 gets ncols1 up to 32. Much less parallelism.
3. **11% wasted compute** from zero-padding (64 extra elements per head that are always zero).

### Why turbo2 was FASTER in April
The pre-optimization MMA kernel was slow enough that turbo2's 6.4x smaller KV cache created bandwidth savings that outweighed the dequant overhead. On a 12GB card with a 9.7GB model, this was significant. The upstream optimizations removed the f16 bottleneck but not the turbo dequant bottleneck.

## Potential Fixes

1. **Extend VEC FA to D=640** (1-2 days): VEC does inline dequant with zero temp buffer overhead. Currently limited to Q->ne[0] <= 256. The VEC template already supports D=640 (`D % (2*WARP_SIZE) == 0` passes). Would need new dispatch entries and template instances.

2. **Fused dequant-MMA kernel** (1-2 weeks): Read turbo2 blocks directly in the MMA kernel, dequant to f16 in shared memory. Avoids temp buffer entirely. Best possible perf but significant kernel work.

3. **Strip zero-padding post-dequant** (2-4 days): Dequant turbo2 K to 576 f16 elements (strip padding), route to D=576 MMA path with full GQA optimization. Requires dequant-aware WHT inverse.

## Impact

This affects all MoE/MLA models with non-128-dim heads where zero-padding is needed. As upstream continues optimizing the f16 MMA path, the gap will widen further. turbo's value on MoE is currently limited to memory compression for longer contexts — it no longer provides a decode speed advantage.

## Environment
- **Hardware:** RTX 3080 Ti (SM 8.6, 12GB VRAM)
- **Build:** `69d8e4be4`, `-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON`
- **Model:** DeepSeek-Coder-V2-Lite-Instruct Q4_K_M

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: turbo2 decode regression on MoE models — VEC/MMA dispatch gap widens with upstream optimizations #131

Summary

Data

Root Cause Analysis

f16 path (fast)

turbo2 path (slow)

Why turbo2 was FASTER in April

Potential Fixes

Impact

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

CUDA: turbo2 decode regression on MoE models — VEC/MMA dispatch gap widens with upstream optimizations #131

Description

Summary

Data

Root Cause Analysis

f16 path (fast)

turbo2 path (slow)

Why turbo2 was FASTER in April

Potential Fixes

Impact

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions