Skip to content

perf(qwen35moe): pool-sized KV reservation keeps experts hot at high max_ctx#428

Open
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-moe-placement
Open

perf(qwen35moe): pool-sized KV reservation keeps experts hot at high max_ctx#428
dusterbloom wants to merge 2 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-moe-placement

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

What

At high --max-ctx, the qwen35moe (Qwen3.6-35B-A3B hybrid SSM-MoE) expert
placement reserved KV for the full max_ctx, forcing experts cold even when
KVFlash bounds resident KV to a small pool. This reserves only pool-sized KV
when KVFlash is active, so experts stay hot.

Why it matters

Real claude-code client, RTX 3090 24 GB, Q3_K_M, --max-ctx 131072:

metric cold cliff (before) all-hot (after)
cold experts 2203 0
cold prefill (26.8K tok) ~108 s 16.1 s
decode ~43 tok/s 66 tok/s

Shape

  • New server/src/common/kvflash_placement.h — architecture-agnostic placement
    decision; the next MoE backend (DS4/gemma4) can reuse the same reservation rule.
  • ~280 LOC. Tests: test_kvflash_placement.cpp (5-case CPU unit) +
    test_kvflash_moe_placement.sh (GPU, 2203 → 0 cold experts).

Base of the KVFlash-MoE series: placement → pager-serde → prefill-snapshot.

Review in cubic

…max_ctx

The MoE expert placement reserved KV for max_ctx (10 GiB @131072) even with
--kvflash, forcing experts cold -> the pool was pure overhead. Reserve for the
resident pool instead when the full reservation would force experts cold, so
experts stay hot at high max_ctx (decouples max_ctx from the expert-placement
cliff). A post-init gate disables KVFlash when it is redundant (full KV already
fits all experts hot), keyed on all-hot-with-full-KV so it never disables a pool
that is itself keeping experts hot.

The rule is a shared pure helper (common/kvflash_placement.h) so future MoE
backends inherit it. Unit test (5 cases, no GPU) + hardware-gated integration
test (RTX 3090: 2203 cold -> 0 cold @max_ctx 131072, decode 43->66 tok/s).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 8 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35moe/qwen35moe_backend.cpp Outdated
Placement called kvflash_pool_from_env(max_context) with default args, taking
the no-budget fallback (max_ctx/2). Runtime sizes the same pool with the real
VRAM budget + scorer policy, getting a speed-capped value (e.g. 16384). On
DFLASH_KVFLASH=auto at high max_ctx this over-reserved KV ~4x, under-budgeting
experts and reducing hot placement.

Extract make_kvflash_budget() + kvflash_scorer_expected() and call them from
both sites so reservation and runtime allocation size the pool identically.
Add a pure unit test pinning the budgeted vs no-budget divergence.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant