perf(qwen35moe): pool-sized KV reservation keeps experts hot at high max_ctx by dusterbloom · Pull Request #428 · Luce-Org/lucebox-hub

dusterbloom · 2026-06-20T15:02:36Z

What

At high --max-ctx, the qwen35moe (Qwen3.6-35B-A3B hybrid SSM-MoE) expert
placement reserved KV for the full max_ctx, forcing experts cold even when
KVFlash bounds resident KV to a small pool. This reserves only pool-sized KV
when KVFlash is active, so experts stay hot.

Why it matters

Real claude-code client, RTX 3090 24 GB, Q3_K_M, --max-ctx 131072:

metric	cold cliff (before)	all-hot (after)
cold experts	2203	0
cold prefill (26.8K tok)	~108 s	16.1 s
decode	~43 tok/s	66 tok/s

Shape

New server/src/common/kvflash_placement.h — architecture-agnostic placement
decision; the next MoE backend (DS4/gemma4) can reuse the same reservation rule.
~280 LOC. Tests: test_kvflash_placement.cpp (5-case CPU unit) +
test_kvflash_moe_placement.sh (GPU, 2203 → 0 cold experts).

Base of the KVFlash-MoE series: placement → pager-serde → prefill-snapshot.

@131072

…max_ctx The MoE expert placement reserved KV for max_ctx (10 GiB @131072) even with --kvflash, forcing experts cold -> the pool was pure overhead. Reserve for the resident pool instead when the full reservation would force experts cold, so experts stay hot at high max_ctx (decouples max_ctx from the expert-placement cliff). A post-init gate disables KVFlash when it is redundant (full KV already fits all experts hot), keyed on all-hot-with-full-KV so it never disables a pool that is itself keeping experts hot. The rule is a shared pure helper (common/kvflash_placement.h) so future MoE backends inherit it. Unit test (5 cases, no GPU) + hardware-gated integration test (RTX 3090: 2203 cold -> 0 cold @max_ctx 131072, decode 43->66 tok/s).

cubic-dev-ai

1 issue found across 8 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

Placement called kvflash_pool_from_env(max_context) with default args, taking the no-budget fallback (max_ctx/2). Runtime sizes the same pool with the real VRAM budget + scorer policy, getting a speed-capped value (e.g. 16384). On DFLASH_KVFLASH=auto at high max_ctx this over-reserved KV ~4x, under-budgeting experts and reducing hot placement. Extract make_kvflash_budget() + kvflash_scorer_expected() and call them from both sites so reservation and runtime allocation size the pool identically. Add a pure unit test pinning the budgeted vs no-budget divergence.

dusterbloom mentioned this pull request Jun 20, 2026

feat(qwen35moe): pooled chunked prefill + snapshot/restore over KVFlash #430

Open

cubic-dev-ai Bot reviewed Jun 20, 2026

View reviewed changes

Comment thread server/src/qwen35moe/qwen35moe_backend.cpp Outdated

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(qwen35moe): pool-sized KV reservation keeps experts hot at high max_ctx#428

perf(qwen35moe): pool-sized KV reservation keeps experts hot at high max_ctx#428
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-moe-placement

dusterbloom commented Jun 20, 2026 •

edited by cubic-dev-ai Bot

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dusterbloom commented Jun 20, 2026 • edited by cubic-dev-ai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why it matters

Shape

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dusterbloom commented Jun 20, 2026 •

edited by cubic-dev-ai Bot

Loading

cubic-dev-ai Bot left a comment •

edited

Loading