feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity#274
feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity#274dusterbloom wants to merge 16 commits into
Conversation
There was a problem hiding this comment.
12 issues found across 37 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Replace size-byte heuristic at qwen3_loader.cpp:109 with the direct gguf_get_tensor_type(ctx, tensor_id) API as suggested in PR Luce-Org#274 review. Return type of gguf_find_tensor is int64_t; updated tidx declaration accordingly.
Line 301: dp50/base computes the inverse (slower = higher), should be base/dp50 so that speedup > 1 means ee7 is faster. The headline 1.82x and 9.29x numbers in PR Luce-Org#274 came from run_niah_ee7_broad.py and run_niah_ee7_longctx.py which had the correct formula; this script's printed table was not used as source data for the SUMMARY.md files.
|
Addressed all cubic + howard0su review feedback. PR body updated with correct Q4_K_M target (was Q3_K_S — factual error) and per-context drafter speedup table (1K-128K + per-client agentic). Howard0su's two fixes applied (gguf_get_tensor_type, passbv docstring). All P1+P2 cubic items fixed across 6 commits. |
…eview) Pure helper with dflash::common namespace was mis-located under dflash/src/qwen3/. Move to dflash/src/common/ so qwen35, gemma4, and future backends can use the same scoring-range logic without reaching into the qwen3 source dir.
|
why layer 7, and not some other layer? Env naming should be PFlash not DFlash to make it more clear that is the prefill drafter |
|
great question @davide221 I started removing from layer 14 then it worked. Then I push and it workd on layer 7. I stopped but actually am benching right now to see if it works on layer 5 or even 3. Let's see As for the dFlash is a typo, agree to rename all for making sure it is clear this is about prefill! |
…RAFTER_* (PR Luce-Org#274 review) - These envs control the pflash drafter (prefill-flash KV compression) early-exit path, not the dflash spec-decode drafter — DFLASH_ prefix was misleading. - Pure rename across 15 files: qwen3_graph.cpp, score_range.h, all bench scripts, two historical SUMMARY docs, and the warm-path regression test. - No alias kept — breaking change by design, same scope as PFLASH_DRAFTER_SLIM. - Per davide221 review comment on PR Luce-Org#274.
There was a problem hiding this comment.
1 issue found across 15 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
…YERS in run_tier1_proof.py (PR Luce-Org#274 review) - cubic review caught: prior version set only SCORE_LAYERS so q8_l7 ran full 28 layers, invalidating the layer-subset measurement - Also clears EARLY_EXIT_N in baseline/Q8 conditions to prevent shell-env leakage - Historical tier1_proof results in 2026-05-21_tier1_proof/ are NOT being re-run — they served as the spike that discovered the scoring-clamp bug (fixed in be16529/90e6b27); preserved as historical record - Going forward, anyone re-running the bench gets correct measurements
…-7 at S%4096∈{1..7})
Root cause (codex + momus data audit):
tail_lo = S - n_lookahead; old guard: tail_lo < cs + cl
When S = k*chunk_size + r (r∈{1..7}, n_lookahead=8), tail_lo+n_lookahead
= cs + cl + r > cs + cl — ggml_view_3d reads r tokens past allocation.
Triggered deterministically at S mod 4096 ∈ {1..7}; ~0.2% of real prefills.
Fix (2 lines, both sites in qwen3_graph.cpp:463 and :516):
- if (tail_lo_nr >= cs && tail_lo_nr < cs + cl)
+ if (tail_lo_nr >= cs && tail_lo_nr + n_lookahead <= cs + cl)
- if (tail_lo >= cs && tail_lo < cs + cl)
+ if (tail_lo >= cs && tail_lo + n_lookahead <= cs + cl)
Semantics: skip tail-capture when n_lookahead tokens won't fit inside chunk.
Unit test (test_drafter_tail_capture_guard) GREEN; all existing tests PASS.
Resolves the 64K/128K NIAH crash-failures diagnosed in PR Luce-Org#274 review.
- NIAH 3/3 at 32K/64K/128K for all four conditions; zero ggml_view_3d asserts - ee3 drafter speedup: 6.9x@32K, 7.6x@64K, 24.3x@128K vs baseline - Multi-client drafter_fwd mean: ee3=6.2x vs baseline, 1.84x vs ee7 - accept_rate unavailable (binary lacks bandit feature); wall_s for claude_code: ee3=1.08s vs ee7=1.29s - Decision: ee3 passes all measurable gates; propose as new default after Luce-Org#274 merges
There was a problem hiding this comment.
1 issue found across 66 files (changes from recent commits).
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
…cleanup) — runner scripts stay User directive: PR Luce-Org#274 ships ee7 source only; bench evidence not committed. Runner scripts in dflash/bench/*.py and dflash/bench/results/*/run_bench*.sh preserved for reproducibility. Result dirs under dflash/bench/results/ removed (SUMMARY.md, raw_results.json/jsonl, case*.json, vram.csv). Equivalent results can be regenerated via the kept scripts.
There was a problem hiding this comment.
6 issues found
You’re at about 90% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="dflash/bench/run_ee_n_sweep_multihop.py">
<violation number="1" location="dflash/bench/run_ee_n_sweep_multihop.py:30">
P2: Context sweep parameter `ctx` is passed to `start_server` but ignored; `--max-ctx` is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size</violation>
</file>
<file name="dflash/src/qwen3/anchor_scan.cpp">
<violation number="1" location="dflash/src/qwen3/anchor_scan.cpp:105">
P2: `prev_forced` is refreshed at the top of every transitive iteration, making `max_iters > 1` ineffective and preventing multi-iteration cascade expansion. For `it > 0`, `forced[c] && !prev_forced[c]` is always false because `prev_forced = forced` was just executed, so the rare-token worklist is empty and the loop breaks at `forced == prev_forced` before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from `scan_and_force` can be processed in the next rare-token pass.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| CONDITION_SPECS = {"ee7": (7, 7)} # (EARLY_EXIT_N, SCORE_LAYERS) | ||
|
|
||
|
|
||
| def start_server(condition, ctx, log_path, compression_mode="always", keep_ratio=0.05): |
There was a problem hiding this comment.
P2: Context sweep parameter ctx is passed to start_server but ignored; --max-ctx is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_sweep_multihop.py, line 30:
<comment>Context sweep parameter `ctx` is passed to `start_server` but ignored; `--max-ctx` is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size</comment>
<file context>
@@ -0,0 +1,276 @@
+CONDITION_SPECS = {"ee7": (7, 7)} # (EARLY_EXIT_N, SCORE_LAYERS)
+
+
+def start_server(condition, ctx, log_path, compression_mode="always", keep_ratio=0.05):
+ # server accepts off|auto|always; allow "none" as alias for "off"
+ srv_compression = "off" if compression_mode == "none" else compression_mode
</file context>
| // Cascade loop: expand pool with newly-forced tokens and re-scan. | ||
| std::vector<uint8_t> prev_forced; | ||
| for (int it = 0; it < max_iters; ++it) { | ||
| prev_forced = forced; |
There was a problem hiding this comment.
P2: prev_forced is refreshed at the top of every transitive iteration, making max_iters > 1 ineffective and preventing multi-iteration cascade expansion. For it > 0, forced[c] && !prev_forced[c] is always false because prev_forced = forced was just executed, so the rare-token worklist is empty and the loop breaks at forced == prev_forced before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from scan_and_force can be processed in the next rare-token pass.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3/anchor_scan.cpp, line 105:
<comment>`prev_forced` is refreshed at the top of every transitive iteration, making `max_iters > 1` ineffective and preventing multi-iteration cascade expansion. For `it > 0`, `forced[c] && !prev_forced[c]` is always false because `prev_forced = forced` was just executed, so the rare-token worklist is empty and the loop breaks at `forced == prev_forced` before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from `scan_and_force` can be processed in the next rare-token pass.</comment>
<file context>
@@ -0,0 +1,169 @@
+ // Cascade loop: expand pool with newly-forced tokens and re-scan.
+ std::vector<uint8_t> prev_forced;
+ for (int it = 0; it < max_iters; ++it) {
+ prev_forced = forced;
+
+ // Rare-token single-match: worklist-driven so cascades within a pass are
</file context>
The committed CONDITION_SPECS in 0e55f48 picked up a temporary {"ee7": (14, 14)} override that was set during an ee14 sweep bench run, instead of the intended {"ee7": (7, 7)}. Restores ee7 as the default condition value. Caught by cubic-dev-ai inline review on PR Luce-Org#274.
…-7 at S%4096∈{1..7})
Root cause (codex + momus data audit):
tail_lo = S - n_lookahead; old guard: tail_lo < cs + cl
When S = k*chunk_size + r (r∈{1..7}, n_lookahead=8), tail_lo+n_lookahead
= cs + cl + r > cs + cl — ggml_view_3d reads r tokens past allocation.
Triggered deterministically at S mod 4096 ∈ {1..7}; ~0.2% of real prefills.
Fix (2 lines, both sites in qwen3_graph.cpp:463 and :516):
- if (tail_lo_nr >= cs && tail_lo_nr < cs + cl)
+ if (tail_lo_nr >= cs && tail_lo_nr + n_lookahead <= cs + cl)
- if (tail_lo >= cs && tail_lo < cs + cl)
+ if (tail_lo >= cs && tail_lo + n_lookahead <= cs + cl)
Semantics: skip tail-capture when n_lookahead tokens won't fit inside chunk.
Unit test (test_drafter_tail_capture_guard) GREEN; all existing tests PASS.
Resolves the 64K/128K NIAH crash-failures diagnosed in PR Luce-Org#274 review.
|
@cubic-dev-ai re: Looked at this carefully — the implementation handles the case you're describing, just differently than the diff suggests. Two pieces:
So the rare-token closure transitively exhausts in a single outer iteration by design. The purpose of T6 in Not a bug — resolving as by design. Will add a clarifying inline comment if the area is touched again. |
@dusterbloom You're right — this is by design, not a bug. T6 is the definitive proof. The two mechanisms you describe cover the concern fully:
The If T6 passes with a 20-link chain and a cap of 5, the implementation is sound. No bug here. Tip: get faster answers by chatting with cubic’s review copilot in the review UI |
- Rename ttft_s → latency_s in run_longbench_hotpotqa.py (stream=False measures total completion time, not TTFT) - Same rename in run_ee_n_sweep_multihop.py (identical stream=False pattern found in audit) - Add empty-input guard in run_longbench_hotpotqa.py (sys.exit on empty cases) - Drop unused ctx parameter from run_ee_n_sweep_multihop.py start_server Cubic flags P2 quality issues from inline review on commit 0e55f48.
|
You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment |
…s on Luce-Org#274) PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 is the production default after ee_n sweep: 6.9x@32K, 24.3x@128K, accept_rate +1.2 pp vs ee7. Reproduce via dflash/bench/run_ee_n_sweep.sh + run_ee_n_multiclient.sh.
1ffc8e6 to
7810648
Compare
Per-session ε-greedy bandit that adjusts compression based on observed accept_rate. Opt-in via session_id; clients without it get the existing fixed-keep path, byte-identical to main. Includes: - Bandit state machine (LRU-bounded session map, cap 1024) - HTTP server session_id parsing + bandit hook - accept_rate plumbing from DFlash GenerateResult - CI submodule PAT fix for fork PRs - Harness session_id env-var wiring 5-turn trajectory + NIAH @16K/32K + 3-seed A/B/C evidence (reproducible via the follow-up bench PR; not committed here). Bench scripts + result artifacts split to follow-up PR. Bug Luce-Org#42 tail-capture fix moved to PR Luce-Org#274.
Record the current squashed PR Luce-Org#274 head as integrated; resolved the CMake conflict by retaining the existing adaptive bandit tests from PR Luce-Org#264 while keeping the already-integrated early-exit drafter files.
cde47cd to
5037b28
Compare
Update the integration manifest after merging the latest PR Luce-Org#274 head (adaptive anchor radius and PFLASH_COMPRESS env rename). Record a fresh PR Luce-Org#266 worktree conflict attempt and current blocked classifications.
5037b28 to
9c9aee9
Compare
…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.
At >=32K context the needle text is more likely to straddle multiple chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks ~160 tokens) loses the back half of the needle digits — the model retrieves '...is 4' but truncates/hallucinates the continuation. Adaptive scaling based on n_chunks: <32K context (<1024 chunks): radius=2, max_anchor_hits=8 (unchanged) 32-64K (1024-2047 chunks): radius=4, max_anchor_hits=16 >=64K (>=2048 chunks): radius=8, max_anchor_hits=32 Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS env vars (legacy DFLASH_COMPRESS_* names still accepted). Validated at 49K context: NIAH needle 'kowefada 1596346' correctly retrieved (was: '1594' or hallucinated 'is 048394839483' before fix). Resolves the long-standing 'project_64k_quality_cliff' memory entry.
Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35 target loader and the dflash decode draft loader. After loading weight tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] / wk->ne[1] and compare against GGUF-declared values; set_last_error and return false on mismatch. Makes the 'stale scalar at graph-build time' bug class structurally impossible. Load-time only, no runtime cost. Existing well-formed GGUFs are unaffected (smoke verified).
When pflash compresses, set gen_req.fa_window_override = effective_prompt + 256 so spec-decode verify sees the entire compressed prompt. Pflash already paid compute to pick which tokens matter; verify never throws any of them away. When the override would exceed 2 * cfg_.fa_window (spec-decode's drafter cost stops earning its tok/J), the C2 gate in qwen35_backend's generate() falls back to AR (fa_window=0, full attention). AR sees every kept token at every context; we choose mechanism, not visibility. Zero new CLI flags. --draft remains the only knob for composition; all per-request adaptation is internal.
…scade default-on Adds backwards-compat fallback wrappers for 6 cascade env vars in both standard and bandit code paths, so harness scripts using either spelling work against this binary. Emits one-time WARN to stderr when the legacy DFLASH_* spelling is honored. Also flips the default for `use_transitive` from `false` to `true` because the gated rare-token bridge improves multi-hop F1 with zero downside in the cascade-already-firing case.
…th drift Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both qwen35_score_and_compress and drafter_score_and_compress call it. Removes two 70-LOC duplicate env-reading blocks and the duplicated anchor-radius comment. Also removes dead force_chunk_neighborhood (no callers) and collapses the 4-overload load_drafter pyramid to one canonical implementation + 3 thin forwarders.
- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks), remove early_exit_n alias (inline early_exit_pre at call site) - qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref; do_ar_decode budget-hook 15-line → 4-line + docs ref - http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref - model_backend.h: BudgetHook 23-line essay → 3-line + docs ref - gguf_target_loader.cpp: 4-line prose tail → 1-line - .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts - docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md, anchor-transitive.md (consolidated rationale)
…nking is off The hard-coded renderer appends a closed think prefill when thinking is disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix, leaving the model in the wrong decoding state for tool use. Mirror the hard-coded behavior here when the rendered prompt ends with a bare assistant generation prompt; tolerate trailing-whitespace variants (single \n, double \n\n, trailing space). Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%: the drafter was distilled with the closed-think suffix in its training distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target and drafter disagree on what comes after <|im_start|>assistant\n.
… only The previous commit applied the closed-think suffix to all Jinja-rendered prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja, defaulting to QWEN3, and guard the post-processing block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe inherits ChatFormat::QWEN3 by design (matches drafter distillation). 5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3 arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append when template already closes the think block. Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.
Extract the C2 spec-decode gate from an inline expression in qwen35_backend.cpp into a pure predicate header c2_gate.h. Zero behavior change. Identical math: (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg) The new header documents the empirically-derived rationale: at compressed KV sizes (pflash compression of long prompts), T_draft/T_target ratio approaches 1, eliminating spec-decode's profit margin over AR. Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when eff_fa_window > 2*fa_window_cfg. Adds 5 unit tests locking in the predicate's behavior with explicit Round 5 4-arm matrix bench citations. Files: - server/src/qwen35/c2_gate.h (new) - server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call) - server/test/test_server_unit.cpp (+60 LOC, 5 tests)
2c19f66 to
e64a2b8
Compare
…default 0.10) - Gate context-window admission on post-compression effective size, not raw, so >128K-raw prompts compress to fit max_ctx instead of 400 / oversized KV reservation. - Pre-compression keep-ratio sanity guard (raw*keep+max_out>max_ctx); the real effective-size gate runs post-compression in worker_loop. - Default prefill-keep-ratio 0.05 -> 0.10: real ~2x compression on agentic content (0.25 over-forces anchor-transitive to ~100% = no-op + rejects >128K). - Evidence (RTX3090, agentic replay, keep=0.10): 167K raw admitted -> 71K eff (42.6%), prefill 145s vs 845s forced; 32-128K real compression; tool-parse intact; 1629 unit asserts green; 14-cell P/PD sweep zero crashes.
…tion, default-off) Pure per-request router that picks pFlash compression per prompt instead of a global keep. Ships DISABLED: the default policy is an exact no-op, so enabling is opt-in and cannot regress recall. - regime_router.h: pure decide_regime (v1) + decide_v2 (type-gate + sparse-prompt guard + recency floor) + recency_floor_for() size-adaptive helper. 172 unit tests, -Werror clean. Default policy returns FullCascade/full-keep for any input. - qwen3_drafter.cpp: PFLASH_RECENCY_FLOOR_TOKENS mechanism (0 = off default; -1 = auto = min(1024, 0.04*S)); env_int negative-sentinel fix. - CMakeLists: test_regime_router target. Evidence (bench/2026-05-30_*): indexer-prune dead (frac_prune=0 across scorers); the splitter is prompt TYPE, not cascade-expansion (rho=-0.27); sparse-plumbing turns fail under compression but pass at full keep (OFF=100%), so route them to full keep. Recency floor validated zero-sum at fixed keep (breaks controls), kept as off-by-default infra, not the mechanism. Router path: agentic-rich -> ~25% keep (~3x prefill/wall), agentic-sparse -> full keep, retrieval -> cascade. Not yet wired into the live request handler (follow-up).
…n mechanisms Wires the compression-regime router into the request path (default-OFF via PFLASH_ROUTER_ENABLE, so this commit is a no-op until enabled) and prunes the router down to only what this session validated. Net -433 lines. Live path (when enabled): - detect agentic vs retrieval at admission (tool schemas / tool_use|tool_result blocks / tool_calls present). JSON-walking lives at the handler boundary (http_server.cpp); the router header stays pure (stdlib-only). - agentic -> cascade off, keep_target 0.25 (~3x prefill/wall, content permitting) - retrieval -> cascade on, full keep - per-request use_transitive threaded through the compress path. Pruned (mechanisms disproven this session, see bench/2026-05-30_*): - v1 cascade-expansion router (R): refuted as keep predictor, Spearman rho=-0.27. - recency-token floor: validated zero-sum at fixed keep (displaces top-K middle context, breaks controls). Removed from header + drafter. - decide_v2 sparse-prompt guard (new_content_tokens): does not separate fails-from-passes under compression. regime_router.h: 271 -> 96 LOC, zero non-stdlib includes, standalone -Werror, 43 unit tests (decide_v2 type-gate + detect_request_type truth-table). Known follow-ups (NOT in this commit): (1) hard empty/degenerate safety guard (result.tokens.size()/degenerate_decode_close already at the update site) to catch empty responses; (2) reconcile bandit keep range [0.025,0.20] vs router floor 0.25 (bandit currently wins per-session); (3) confident-off-task closure needs client/app feedback, not a label-free reward (Momus-confirmed).
…Luce-Org#10) Closes the two validated pieces of the adaptive-keep path (the label-free quality-reward idea was dropped — Momus-confirmed it can't catch confident off-task). Default-OFF; router gates these to agentic-routed requests. - regime_router.h: two pure helpers (stdlib-only, TDD'd) — clamp_keep_to_floor(bandit_keep, router_floor, agentic): agentic effective keep = max(bandit_keep, floor) so the bandit's 0.20 ceiling can no longer silently undercut the router's 0.25 floor. compression_failed(tokens, degenerate_close, agentic_compressed, min=8): true on empty/degenerate output of an agentic compressed turn. - adaptive_keep_ratio.h: per-session recover_full_next flag (+ set/consume). - http_server.cpp: floor clamp at keep-apply; at the post-generate update site, on compression_failed → skip the bandit update (failure noise) and set the session to full keep for the next turn (deterministic recovery from the empty-response failure class, e.g. LONG_B t10). PFLASH_GUARD_MIN_TOKENS env (default 8) tunes the guard threshold. - 59 standalone unit tests, -Werror. LIVE-VALIDATED on RTX 3090 (server up on :18097, 34K-token prompts): - type-gate: agentic→keep 0.250/cascade-off, retrieval→cascade-on. - guard recovery loop: turn1 compression_failed→full-keep-next (resp_tokens=13, bandit update skipped); turn2 same session recover_full_next consumed→keep 1.0. - floor clamp fired: agentic bandit 0.100 < floor 0.250 → 0.250. Launch config (24GB): GGML_CUDA_NO_VMM=1 + --max-ctx 49152 (139264 KV OOMs the 3090 — that was the pre-existing bad_alloc, not this change). Still default-OFF via PFLASH_ROUTER_ENABLE.
Merge latest feat/pflash-drafter-ee7 head and resolve conflicts by preserving the integration stack's pFlash curve/bandit/upstream/SWA behavior while adding the opt-in PFLASH_ROUTER_ENABLE regime router and standalone router test.
Correct the final included-head count after PR Luce-Org#274 advanced; the latest head replaced the prior included head, leaving 21 included and 6 non-ancestor ready PRs.
TL;DR
Adaptive composition stack: pflash compression + dflash spec-decode + ee7 drafter + anchor-transitive cascade + gated rare-token bridge + adaptive
anchor_radius(64K NIAH cliff fix) + per-requestfa_window_override+ C2 mechanism gate + type-gate regime router (default-OFF).Prefill speedup is content-dependent:
Decode at parity (composition); +35% with the router on (shorter compressed KV).
Adaptive levers
Adaptive
anchor_radius(4a838dd):<32K=2 / 32–64K=4 / ≥64K=8based onn_chunks. Eliminates the 64K NIAH cliff. Override viaPFLASH_COMPRESS_ANCHOR_RADIUS.Per-request
fa_window_override(2726608, Design 1): when pflash compresses,http_serversetsfa_window_override = effective_prompt + 256. Never caps visibility.C2 mechanism gate (
2726608+ refactor548be0a): ifeff_fa_window ≤ 2 × cfg_.fa_window(= 4096 default), spec-decode runs. Else AR fallback. Extracted toserver/src/qwen35/c2_gate.has a pure tested predicate + 5 unit tests. Zero behavior change.Effective-size admission gate + keep-ratio guard (
8c1f37d): the context-window admission check gates on the post-compression effective size, not the raw prompt — a prompt larger thanmax_ctxthat compresses to fit is admitted (previously a hard 400) without forcing an oversized KV reservation. Default--prefill-keep-ratio0.05→0.10 (anchors yield ~40% effective;keep=0.25over-forces the cascade to ~100% = near-no-op). Verified at >128K: raw 166,911 → eff 71,110 (42.6%), admitted atmax_ctx=139264.Type-gate regime router (
fbc2d41/b31544f/8fc961b, default-OFF behindPFLASH_ROUTER_ENABLE): correct-by-construction admission-time gate.server/src/common/regime_router.his pure stdlib (96 LOC, 59 standalone-Werrortests). Logic:toolsarray /tool_use/tool_callspresent) → keep floor 0.25, transitive cascade OFF. The cascade keys on rare-token co-occurrence, and the 151K drafter vocab makes ~every token "rare" → it over-forces dense agentic content to 70–100% keep (a near-no-op). Disabling it + flooring at 0.25 is what delivers the win below.PFLASH_GUARD_MIN_TOKENS, default 8).Verified A/B (router ON vs OFF) — 2026-05-31
Binary built from this branch tip (
8fc961b), verified to contain both the admission-gate (dflash::common::check_admissionsymbol) and the router. 4 cold-prefill turns on real organic agentic sessions (claude-code transcripts, user turns at 34–37K est tokens, native tool blocks +toolsarray → agentic path fires). Fresh server per arm (true cold prefill). Run dir:bench/2026-05-31_definitive_ab/run_20260531_111334/.All 4 ON turns routed
type=agentic, keep=0.25, cascade=off. 0 floor-clamps, 0 guard fires. The admission-gate does not suppress the cascade over-forcing (OFF still 72% avg), so the router win is real on the shipped base. N=4 cold turns; numbers match the earlier-base measurement (2.85×/2.89×) within noise.Reproducibility
Env:
PFLASH_COMPRESS_ANCHOR_TRANSITIVE=1 PFLASH_DRAFTER_EARLY_EXIT_N=7 PFLASH_DRAFTER_SCORE_LAYERS=7 DFLASH27B_KV_K=tq3_0 DFLASH27B_KV_V=tq3_0. Router: addPFLASH_ROUTER_ENABLE=1.Bench dashboard:
bench/2026-05-28_adaptive_stack/Round5_4arm_matrix_174641/DASHBOARD.md. Public evidence: https://github.com/dusterbloom/pflash-evidence.Notes