feat(dflash): reduce feature mirror memory with dtype policy by weicj · Pull Request #309 · Luce-Org/lucebox-hub

weicj · 2026-05-29T16:11:11Z

Summary

This PR adds a dtype policy for the DFlash draft-side feature mirror to reduce DFlash OOM risk while keeping the existing draft graph contract unchanged.

DFlash mirrors captured target hidden states on the draft side so the draft model can propose tokens from real target features. That mirror was always F32, which makes it a direct memory pressure point for DFlash, including target layer-split DFlash. This PR keeps F32 as the default and adds optional DFLASH_FEATURE_DTYPE=f16|bf16|q8_0 storage for the mirror. Rows are converted back to F32 before feeding target_hidden_cat, so the draft graph input contract remains unchanged.

For the Qwen3.6-27B path, the directly reduced allocation is cap * captured_feature_width * bytes_per_element. At the full 256K context cap (cap=262144) and captured_feature_width=25600:

F32 mirror: about 25.0 GiB
F16/BF16 mirror: about 12.5 GiB, saving about 12.5 GiB
Q8_0 mirror: about 6.6 GiB, saving about 18.4 GiB versus F32

Changes

Add DFLASH_FEATURE_DTYPE=f32|f16|bf16|q8_0 for the DFlash draft feature mirror.
Keep default behavior as F32.
Store the draft-side feature mirror as F16, BF16, or Q8_0 when requested.
Use ggml row sizing and type traits for dtype-aware mirror storage, including Q8_0 block layout.
Convert mirror rows back to F32 before feeding target_hidden_cat.
Disable direct mirror view for non-F32 mirror storage because the draft graph still expects F32 input.
Make Qwen35 target-layer-split draft feature snapshot/restore dtype-aware, so prefix-cache restore does not read or write reduced-precision mirror storage as raw F32.

Notes

Local HumanEval A/B was run on dual Pro VII / gfx906 with Qwen3.6-27B Q4_K_M target, Qwen3.6 DFlash Q8_0 draft, HIP same-backend target layer split, local DFlash draft, DDTree enabled, and a 4K runtime cap. Each dtype ran the same 10 HumanEval prompts with 128 generated tokens.

Mirror dtype	DFlash accepted	Server decode avg	Harness output avg	Peak VRAM sample card0/card1
f32	1134/1968, 57.6%	20.83 tok/s	22.72 tok/s	64% / 56%
bf16	1135/1968, 57.7%	20.60 tok/s	22.46 tok/s	63% / 56%
f16	1129/2000, 56.5%	20.16 tok/s	21.97 tok/s	63% / 56%
q8_0	1133/1984, 57.1%	20.21 tok/s	22.02 tok/s	62% / 56%

Q8_0 did not show the acceptance collapse we were concerned about in this HumanEval prompt run. Its acceptance stayed in the same range as F32/BF16/F16, with a small decode-speed cost. The VRAM sampler on this machine reports percent-level readings, so the sampled peak is coarse; the exact mirror allocation reduction is the formula above.

cubic-dev-ai

1 issue found across 3 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

PR Luce-Org#309 introduced configurable feature-mirror storage types. Keep the default F32 mirror on the previous CUDA BF16-to-F32 conversion path so the unchanged configuration does not regress through a host round trip.

Record the 2026-05-29 14:03 cron pass, confirming Luce-Org#309/Luce-Org#310 and all other current included PR heads remain ancestors of easel/auto-integration. Re-probe the remaining old non-draft PRs from the current integration tip and record their conflict sets and retained worktrees.

cubic-dev-ai Bot reviewed May 29, 2026

View reviewed changes

Comment thread server/src/common/dflash_feature_ring.cpp Outdated

feat(dflash): reduce feature mirror memory with dtype policy

ea6ac48

weicj force-pushed the experiment-dflash-feature-dtype branch from ad5ac25 to ea6ac48 Compare May 29, 2026 16:34

weicj mentioned this pull request May 29, 2026

feat(server): reduce layer-split activation memory with backend precision policy #310

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dflash): reduce feature mirror memory with dtype policy#309

feat(dflash): reduce feature mirror memory with dtype policy#309
weicj wants to merge 1 commit into
Luce-Org:mainfrom
weicj:experiment-dflash-feature-dtype

weicj commented May 29, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

weicj commented May 29, 2026

Summary

Changes

Notes

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading