Skip to content

feat(dflash): reduce feature mirror memory with dtype policy#309

Open
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:experiment-dflash-feature-dtype
Open

feat(dflash): reduce feature mirror memory with dtype policy#309
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:experiment-dflash-feature-dtype

Conversation

@weicj
Copy link
Copy Markdown
Collaborator

@weicj weicj commented May 29, 2026

Summary

This PR adds a dtype policy for the DFlash draft-side feature mirror to reduce DFlash OOM risk while keeping the existing draft graph contract unchanged.

DFlash mirrors captured target hidden states on the draft side so the draft model can propose tokens from real target features. That mirror was always F32, which makes it a direct memory pressure point for DFlash, including target layer-split DFlash. This PR keeps F32 as the default and adds optional DFLASH_FEATURE_DTYPE=f16|bf16|q8_0 storage for the mirror. Rows are converted back to F32 before feeding target_hidden_cat, so the draft graph input contract remains unchanged.

For the Qwen3.6-27B path, the directly reduced allocation is cap * captured_feature_width * bytes_per_element. At the full 256K context cap (cap=262144) and captured_feature_width=25600:

  • F32 mirror: about 25.0 GiB
  • F16/BF16 mirror: about 12.5 GiB, saving about 12.5 GiB
  • Q8_0 mirror: about 6.6 GiB, saving about 18.4 GiB versus F32

Changes

  • Add DFLASH_FEATURE_DTYPE=f32|f16|bf16|q8_0 for the DFlash draft feature mirror.
  • Keep default behavior as F32.
  • Store the draft-side feature mirror as F16, BF16, or Q8_0 when requested.
  • Use ggml row sizing and type traits for dtype-aware mirror storage, including Q8_0 block layout.
  • Convert mirror rows back to F32 before feeding target_hidden_cat.
  • Disable direct mirror view for non-F32 mirror storage because the draft graph still expects F32 input.
  • Make Qwen35 target-layer-split draft feature snapshot/restore dtype-aware, so prefix-cache restore does not read or write reduced-precision mirror storage as raw F32.

Notes

Local HumanEval A/B was run on dual Pro VII / gfx906 with Qwen3.6-27B Q4_K_M target, Qwen3.6 DFlash Q8_0 draft, HIP same-backend target layer split, local DFlash draft, DDTree enabled, and a 4K runtime cap. Each dtype ran the same 10 HumanEval prompts with 128 generated tokens.

Mirror dtype DFlash accepted Server decode avg Harness output avg Peak VRAM sample card0/card1
f32 1134/1968, 57.6% 20.83 tok/s 22.72 tok/s 64% / 56%
bf16 1135/1968, 57.7% 20.60 tok/s 22.46 tok/s 63% / 56%
f16 1129/2000, 56.5% 20.16 tok/s 21.97 tok/s 63% / 56%
q8_0 1133/1984, 57.1% 20.21 tok/s 22.02 tok/s 62% / 56%

Q8_0 did not show the acceptance collapse we were concerned about in this HumanEval prompt run. Its acceptance stayed in the same range as F32/BF16/F16, with a small decode-speed cost. The VRAM sampler on this machine reports percent-level readings, so the sampled peak is coarse; the exact mirror allocation reduction is the formula above.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 3 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/dflash_feature_ring.cpp Outdated
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
PR Luce-Org#309 introduced configurable feature-mirror storage types. Keep the default F32 mirror on the previous CUDA BF16-to-F32 conversion path so the unchanged configuration does not regress through a host round trip.
@weicj weicj force-pushed the experiment-dflash-feature-dtype branch from ad5ac25 to ea6ac48 Compare May 29, 2026 16:34
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Record the 2026-05-29 14:03 cron pass, confirming Luce-Org#309/Luce-Org#310 and all other current included PR heads remain ancestors of easel/auto-integration. Re-probe the remaining old non-draft PRs from the current integration tip and record their conflict sets and retained worktrees.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant