perf(qwen35moe): pool-sized KV reservation keeps experts hot at high max_ctx#428
Open
dusterbloom wants to merge 2 commits into
Open
perf(qwen35moe): pool-sized KV reservation keeps experts hot at high max_ctx#428dusterbloom wants to merge 2 commits into
dusterbloom wants to merge 2 commits into
Conversation
…max_ctx The MoE expert placement reserved KV for max_ctx (10 GiB @131072) even with --kvflash, forcing experts cold -> the pool was pure overhead. Reserve for the resident pool instead when the full reservation would force experts cold, so experts stay hot at high max_ctx (decouples max_ctx from the expert-placement cliff). A post-init gate disables KVFlash when it is redundant (full KV already fits all experts hot), keyed on all-hot-with-full-KV so it never disables a pool that is itself keeping experts hot. The rule is a shared pure helper (common/kvflash_placement.h) so future MoE backends inherit it. Unit test (5 cases, no GPU) + hardware-gated integration test (RTX 3090: 2203 cold -> 0 cold @max_ctx 131072, decode 43->66 tok/s).
Contributor
There was a problem hiding this comment.
1 issue found across 8 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Placement called kvflash_pool_from_env(max_context) with default args, taking the no-budget fallback (max_ctx/2). Runtime sizes the same pool with the real VRAM budget + scorer policy, getting a speed-capped value (e.g. 16384). On DFLASH_KVFLASH=auto at high max_ctx this over-reserved KV ~4x, under-budgeting experts and reducing hot placement. Extract make_kvflash_budget() + kvflash_scorer_expected() and call them from both sites so reservation and runtime allocation size the pool identically. Add a pure unit test pinning the budgeted vs no-budget divergence.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
At high
--max-ctx, the qwen35moe (Qwen3.6-35B-A3B hybrid SSM-MoE) expertplacement reserved KV for the full
max_ctx, forcing experts cold even whenKVFlash bounds resident KV to a small pool. This reserves only pool-sized KV
when KVFlash is active, so experts stay hot.
Why it matters
Real claude-code client, RTX 3090 24 GB, Q3_K_M,
--max-ctx 131072:Shape
server/src/common/kvflash_placement.h— architecture-agnostic placementdecision; the next MoE backend (DS4/gemma4) can reuse the same reservation rule.
test_kvflash_placement.cpp(5-case CPU unit) +test_kvflash_moe_placement.sh(GPU, 2203 → 0 cold experts).Base of the KVFlash-MoE series: placement → pager-serde → prefill-snapshot.