Skip to content

CUDA: turbo2 decode regression on MoE — VEC/MMA dispatch gap widens with upstream optimizations #132

@seanrasch

Description

@seanrasch

Summary

turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) fell from 1.76x of f16 (April) to 0.45x of f16 (May canonical). Upstream MoE optimizations benefit the MMA FA path (f16) but not the turbo code path.

Data

Build f16 tg128 turbo2 tg128 Ratio
April (~6545786) 56.4 99.5 1.76x
May (69d8e4b) 241.5 108 0.45x

f16 improved 4.28x. turbo2 improved 9%.

Root Cause

f16 path: D=576 → MMA FA with ncols2=16, full GQA optimization, upstream 16-byte load granularity. No dequant.

turbo2 path: D=640 (zero-padded) → MMA FA, but:

  1. Full KV dequant to f16 temp buffer every decode token (fattn-common.cuh ~line 1340)
  2. D=640 capped at ncols1=2 (shared memory), vs D=576 gets ncols1 up to 32
  3. 11% wasted compute from zero-padding

Turbo2 was faster in April only because the pre-optimization MMA was slow enough that turbo2's 6.4x compression relieved bandwidth pressure on the 12GB card. Upstream removed the f16 bottleneck but not the turbo dequant bottleneck.

Potential Fixes

  1. Extend VEC FA to D=640 (~1-2 days): Inline dequant, no temp buffer. VEC template supports D=640 (640 % 64 == 0). Needs new dispatch entries.
  2. Fused dequant-MMA (~1-2 weeks): Read turbo2 directly in MMA kernel. Best perf, hardest.
  3. Strip zero-padding post-dequant (~2-4 days): Route to D=576 MMA path with full GQA.

Impact

Affects all MoE/MLA models with non-128-dim heads. Gap widens as upstream MMA path improves. Turbo value on MoE is now memory compression only, not decode speed.

Hardware: RTX 3080 Ti SM 8.6 | Model: DeepSeek-Coder-V2-Lite Q4_K_M

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions