CUDA: turbo2 decode regression on MoE — VEC/MMA dispatch gap widens with upstream optimizations

## Summary

turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) fell from 1.76x of f16 (April) to 0.45x of f16 (May canonical). Upstream MoE optimizations benefit the MMA FA path (f16) but not the turbo code path.

## Data

| Build | f16 tg128 | turbo2 tg128 | Ratio |
|-------|-----------|--------------|-------|
| April (~6545786) | 56.4 | 99.5 | **1.76x** |
| May (69d8e4be4) | **241.5** | 108 | **0.45x** |

f16 improved 4.28x. turbo2 improved 9%.

## Root Cause

**f16 path**: D=576 → MMA FA with ncols2=16, full GQA optimization, upstream 16-byte load granularity. No dequant.

**turbo2 path**: D=640 (zero-padded) → MMA FA, but:
1. Full KV dequant to f16 temp buffer every decode token (fattn-common.cuh ~line 1340)
2. D=640 capped at ncols1=2 (shared memory), vs D=576 gets ncols1 up to 32
3. 11% wasted compute from zero-padding

Turbo2 was faster in April only because the pre-optimization MMA was slow enough that turbo2's 6.4x compression relieved bandwidth pressure on the 12GB card. Upstream removed the f16 bottleneck but not the turbo dequant bottleneck.

## Potential Fixes

1. **Extend VEC FA to D=640** (~1-2 days): Inline dequant, no temp buffer. VEC template supports D=640 (640 % 64 == 0). Needs new dispatch entries.
2. **Fused dequant-MMA** (~1-2 weeks): Read turbo2 directly in MMA kernel. Best perf, hardest.
3. **Strip zero-padding post-dequant** (~2-4 days): Route to D=576 MMA path with full GQA.

## Impact

Affects all MoE/MLA models with non-128-dim heads. Gap widens as upstream MMA path improves. Turbo value on MoE is now memory compression only, not decode speed.

**Hardware:** RTX 3080 Ti SM 8.6 | **Model:** DeepSeek-Coder-V2-Lite Q4_K_M

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUDA: turbo2 decode regression on MoE — VEC/MMA dispatch gap widens with upstream optimizations #132

Summary

Data

Root Cause

Potential Fixes

Impact

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Build	f16 tg128	turbo2 tg128	Ratio
April (~6545786)	56.4	99.5	1.76x
May (`69d8e4b`)	241.5	108	0.45x

Uh oh!

CUDA: turbo2 decode regression on MoE — VEC/MMA dispatch gap widens with upstream optimizations #132

Description

Summary

Data

Root Cause

Potential Fixes

Impact

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions