Skip to content

CUDA: turbo2 decode regression on MoE models — VEC/MMA dispatch gap widens with upstream optimizations #131

@seanrasch

Description

@seanrasch

Summary

turbo2 decode on DeepSeek-Coder-V2-Lite (MoE, head_k=192, head_v=128) has fallen from 1.76x of f16 (April) to 0.45x of f16 (May canonical). This is not a regression in turbo code — it's that upstream MoE optimizations benefit the MMA FA path (used by f16) but not the turbo code path.

Data

April (commit ~6545786, turbo-flash-cuda build):

Config pp512 tg128
f16/f16 1,463 56.4
turbo2/turbo2 156 99.5 (1.76x f16)

May (commit 69d8e4b, canonical tip):

Config pp512 tg128
f16/f16 5,328 241.5
turbo2/turbo2 158 108 (0.45x f16)

f16 improved 4.28x. turbo2 improved 9%. The gap will keep widening as upstream MoE/MLA optimizations continue.

Root Cause Analysis

f16 path (fast)

fattn.cu dispatch → D=576, V=512 → MMA FA kernel with ncols2=16 (full GQA optimization, gqa_ratio=16). No dequant needed. Benefits directly from upstream MMA data loading optimizations (4-byte → 16-byte granularity).

turbo2 path (slow)

fattn.cu dispatch → D=640 (zero-padded from 576) → MMA FA kernel, but:

  1. Full KV cache dequant every decode step (fattn-common.cuh:~1340): K_f16.alloc(ggml_nelements(K)) + to_fp16(). Converts entire turbo2 K and V to f16 temp buffers for every single decode token.
  2. D=640 path capped at ncols1=2 due to shared memory limits (~line 208 in fattn-mma-f16.cuh), while D=576 gets ncols1 up to 32. Much less parallelism.
  3. 11% wasted compute from zero-padding (64 extra elements per head that are always zero).

Why turbo2 was FASTER in April

The pre-optimization MMA kernel was slow enough that turbo2's 6.4x smaller KV cache created bandwidth savings that outweighed the dequant overhead. On a 12GB card with a 9.7GB model, this was significant. The upstream optimizations removed the f16 bottleneck but not the turbo dequant bottleneck.

Potential Fixes

  1. Extend VEC FA to D=640 (1-2 days): VEC does inline dequant with zero temp buffer overhead. Currently limited to Q->ne[0] <= 256. The VEC template already supports D=640 (D % (2*WARP_SIZE) == 0 passes). Would need new dispatch entries and template instances.

  2. Fused dequant-MMA kernel (1-2 weeks): Read turbo2 blocks directly in the MMA kernel, dequant to f16 in shared memory. Avoids temp buffer entirely. Best possible perf but significant kernel work.

  3. Strip zero-padding post-dequant (2-4 days): Dequant turbo2 K to 576 f16 elements (strip padding), route to D=576 MMA path with full GQA optimization. Requires dequant-aware WHT inverse.

Impact

This affects all MoE/MLA models with non-128-dim heads where zero-padding is needed. As upstream continues optimizing the f16 MMA path, the gap will widen further. turbo's value on MoE is currently limited to memory compression for longer contexts — it no longer provides a decode speed advantage.

Environment

  • Hardware: RTX 3080 Ti (SM 8.6, 12GB VRAM)
  • Build: 69d8e4be4, -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA_ALL_QUANTS=ON
  • Model: DeepSeek-Coder-V2-Lite-Instruct Q4_K_M

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions