feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity by dusterbloom · Pull Request #274 · Luce-Org/lucebox-hub

dusterbloom · 2026-05-24T08:12:11Z

TL;DR

Adaptive composition stack: pflash compression + dflash spec-decode + ee7 drafter + anchor-transitive cascade + gated rare-token bridge + adaptive anchor_radius (64K NIAH cliff fix) + per-request fa_window_override + C2 mechanism gate + type-gate regime router (default-OFF).

Prefill speedup is content-dependent:

~12× prefill-TPS on sparse/NIAH cells (Round5),
~2.2× wall on dense agentic 64–128K (full-sweep 2026-05-29),
3.00× prefill-TPS / 2.91× wall on real dense >32K agentic turns with the router ON (verified on this branch's tip, 2026-05-31 — see §Verified A/B).

Decode at parity (composition); +35% with the router on (shorter compressed KV).

Adaptive levers

Adaptive anchor_radius (4a838dd): <32K=2 / 32–64K=4 / ≥64K=8 based on n_chunks. Eliminates the 64K NIAH cliff. Override via PFLASH_COMPRESS_ANCHOR_RADIUS.
Per-request fa_window_override (2726608, Design 1): when pflash compresses, http_server sets fa_window_override = effective_prompt + 256. Never caps visibility.
C2 mechanism gate (2726608 + refactor 548be0a): if eff_fa_window ≤ 2 × cfg_.fa_window (= 4096 default), spec-decode runs. Else AR fallback. Extracted to server/src/qwen35/c2_gate.h as a pure tested predicate + 5 unit tests. Zero behavior change.
Effective-size admission gate + keep-ratio guard (8c1f37d): the context-window admission check gates on the post-compression effective size, not the raw prompt — a prompt larger than max_ctx that compresses to fit is admitted (previously a hard 400) without forcing an oversized KV reservation. Default --prefill-keep-ratio 0.05→0.10 (anchors yield ~40% effective; keep=0.25 over-forces the cascade to ~100% = near-no-op). Verified at >128K: raw 166,911 → eff 71,110 (42.6%), admitted at max_ctx=139264.
Type-gate regime router (fbc2d41 / b31544f / 8fc961b, default-OFF behind PFLASH_ROUTER_ENABLE): correct-by-construction admission-time gate. server/src/common/regime_router.h is pure stdlib (96 LOC, 59 standalone -Werror tests). Logic:
- agentic request (structured tools array / tool_use / tool_calls present) → keep floor 0.25, transitive cascade OFF. The cascade keys on rare-token co-occurrence, and the 151K drafter vocab makes ~every token "rare" → it over-forces dense agentic content to 70–100% keep (a near-no-op). Disabling it + flooring at 0.25 is what delivers the win below.
- retrieval request → full keep, cascade ON (cascade is correct for QA/NIAH recall).
- empty/degenerate guard: a compressed agentic turn that returns empty/degenerate skips the keep-bandit and one-shot recovers to full keep next turn (PFLASH_GUARD_MIN_TOKENS, default 8).
- floor-clamp: the online keep-bandit (PR feat(pflash): adaptive keep_ratio bandit MVP #264) is clamped up to the 0.25 agentic floor so it can't undercut the router.

Verified A/B (router ON vs OFF) — 2026-05-31

Binary built from this branch tip (8fc961b), verified to contain both the admission-gate (dflash::common::check_admission symbol) and the router. 4 cold-prefill turns on real organic agentic sessions (claude-code transcripts, user turns at 34–37K est tokens, native tool blocks + tools array → agentic path fires). Fresh server per arm (true cold prefill). Run dir: bench/2026-05-31_definitive_ab/run_20260531_111334/.

Metric	ROUTER OFF	ROUTER ON	Speedup
keep %	72.3% (44.7–100)	24.9% (25)	—
prefill_TPS	1248	3747	3.00×
wall_s	64.9	22.3	2.91×
decode_TPS	20.6	27.8	+35%
tool_parse	4/4 OK	4/4 OK	held

All 4 ON turns routed type=agentic, keep=0.25, cascade=off. 0 floor-clamps, 0 guard fires. The admission-gate does not suppress the cascade over-forcing (OFF still 72% avg), so the router win is real on the shipped base. N=4 cold turns; numbers match the earlier-base measurement (2.85×/2.89×) within noise.

Reproducibility

./server/build/dflash_server /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --draft /path/to/dflash-draft-3.6-q4_k_m.gguf \
  --prefill-drafter /path/to/Qwen3-0.6B-Q8_0.gguf \
  --prefill-compression always --prefill-keep-ratio 0.10 \
  --prefill-threshold 32000 --max-ctx 139264 \
  --chat-template-file qwen3-coder-chat-template.jinja

Env: PFLASH_COMPRESS_ANCHOR_TRANSITIVE=1 PFLASH_DRAFTER_EARLY_EXIT_N=7 PFLASH_DRAFTER_SCORE_LAYERS=7 DFLASH27B_KV_K=tq3_0 DFLASH27B_KV_V=tq3_0. Router: add PFLASH_ROUTER_ENABLE=1.

Bench dashboard: bench/2026-05-28_adaptive_stack/Round5_4arm_matrix_174641/DASHBOARD.md. Public evidence: https://github.com/dusterbloom/pflash-evidence.

Notes

The "+47% decode TPS" figure is C_pflash_only vs A_baseline, not D_composition vs A_baseline. Pflash compression delivers it. Composition decode TPS is at parity-or-slightly-below pflash-only because the C2 gate (correctly) disables spec-decode on compressed KV.
The router (lever 5) is default-OFF; merging it changes no default behavior. It adds opt-in code + 401 LOC of tests.

cubic-dev-ai

12 issues found across 37 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

Replace size-byte heuristic at qwen3_loader.cpp:109 with the direct gguf_get_tensor_type(ctx, tensor_id) API as suggested in PR Luce-Org#274 review. Return type of gguf_find_tensor is int64_t; updated tidx declaration accordingly.

Line 301: dp50/base computes the inverse (slower = higher), should be base/dp50 so that speedup > 1 means ee7 is faster. The headline 1.82x and 9.29x numbers in PR Luce-Org#274 came from run_niah_ee7_broad.py and run_niah_ee7_longctx.py which had the correct formula; this script's printed table was not used as source data for the SUMMARY.md files.

dusterbloom · 2026-05-24T09:11:33Z

Addressed all cubic + howard0su review feedback. PR body updated with correct Q4_K_M target (was Q3_K_S — factual error) and per-context drafter speedup table (1K-128K + per-client agentic). Howard0su's two fixes applied (gguf_get_tensor_type, passbv docstring). All P1+P2 cubic items fixed across 6 commits.

…eview) Pure helper with dflash::common namespace was mis-located under dflash/src/qwen3/. Move to dflash/src/common/ so qwen35, gemma4, and future backends can use the same scoring-range logic without reaching into the qwen3 source dir.

davide221 · 2026-05-24T10:56:04Z

why layer 7, and not some other layer? Env naming should be PFlash not DFlash to make it more clear that is the prefill drafter

dusterbloom · 2026-05-24T11:47:41Z

great question @davide221 I started removing from layer 14 then it worked. Then I push and it workd on layer 7. I stopped but actually am benching right now to see if it works on layer 5 or even 3. Let's see

As for the dFlash is a typo, agree to rename all for making sure it is clear this is about prefill!

…RAFTER_* (PR Luce-Org#274 review) - These envs control the pflash drafter (prefill-flash KV compression) early-exit path, not the dflash spec-decode drafter — DFLASH_ prefix was misleading. - Pure rename across 15 files: qwen3_graph.cpp, score_range.h, all bench scripts, two historical SUMMARY docs, and the warm-path regression test. - No alias kept — breaking change by design, same scope as PFLASH_DRAFTER_SLIM. - Per davide221 review comment on PR Luce-Org#274.

cubic-dev-ai

1 issue found across 15 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

…YERS in run_tier1_proof.py (PR Luce-Org#274 review) - cubic review caught: prior version set only SCORE_LAYERS so q8_l7 ran full 28 layers, invalidating the layer-subset measurement - Also clears EARLY_EXIT_N in baseline/Q8 conditions to prevent shell-env leakage - Historical tier1_proof results in 2026-05-21_tier1_proof/ are NOT being re-run — they served as the spike that discovered the scoring-clamp bug (fixed in be16529/90e6b27); preserved as historical record - Going forward, anyone re-running the bench gets correct measurements

…-7 at S%4096∈{1..7}) Root cause (codex + momus data audit): tail_lo = S - n_lookahead; old guard: tail_lo < cs + cl When S = k*chunk_size + r (r∈{1..7}, n_lookahead=8), tail_lo+n_lookahead = cs + cl + r > cs + cl — ggml_view_3d reads r tokens past allocation. Triggered deterministically at S mod 4096 ∈ {1..7}; ~0.2% of real prefills. Fix (2 lines, both sites in qwen3_graph.cpp:463 and :516): - if (tail_lo_nr >= cs && tail_lo_nr < cs + cl) + if (tail_lo_nr >= cs && tail_lo_nr + n_lookahead <= cs + cl) - if (tail_lo >= cs && tail_lo < cs + cl) + if (tail_lo >= cs && tail_lo + n_lookahead <= cs + cl) Semantics: skip tail-capture when n_lookahead tokens won't fit inside chunk. Unit test (test_drafter_tail_capture_guard) GREEN; all existing tests PASS. Resolves the 64K/128K NIAH crash-failures diagnosed in PR Luce-Org#274 review.

- NIAH 3/3 at 32K/64K/128K for all four conditions; zero ggml_view_3d asserts - ee3 drafter speedup: 6.9x@32K, 7.6x@64K, 24.3x@128K vs baseline - Multi-client drafter_fwd mean: ee3=6.2x vs baseline, 1.84x vs ee7 - accept_rate unavailable (binary lacks bandit feature); wall_s for claude_code: ee3=1.08s vs ee7=1.29s - Decision: ee3 passes all measurable gates; propose as new default after Luce-Org#274 merges

cubic-dev-ai

1 issue found across 66 files (changes from recent commits).

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

…cleanup) — runner scripts stay User directive: PR Luce-Org#274 ships ee7 source only; bench evidence not committed. Runner scripts in dflash/bench/*.py and dflash/bench/results/*/run_bench*.sh preserved for reproducibility. Result dirs under dflash/bench/results/ removed (SUMMARY.md, raw_results.json/jsonl, case*.json, vram.csv). Equivalent results can be regenerated via the kept scripts.

cubic-dev-ai

6 issues found

_{You’re at about 90% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.}

Prompt for AI agents (unresolved issues)


Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/bench/run_ee_n_sweep_multihop.py">

<violation number="1" location="dflash/bench/run_ee_n_sweep_multihop.py:30">
P2: Context sweep parameter `ctx` is passed to `start_server` but ignored; `--max-ctx` is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size</violation>
</file>

<file name="dflash/src/qwen3/anchor_scan.cpp">

<violation number="1" location="dflash/src/qwen3/anchor_scan.cpp:105">
P2: `prev_forced` is refreshed at the top of every transitive iteration, making `max_iters > 1` ineffective and preventing multi-iteration cascade expansion. For `it > 0`, `forced[c] && !prev_forced[c]` is always false because `prev_forced = forced` was just executed, so the rare-token worklist is empty and the loop breaks at `forced == prev_forced` before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from `scan_and_force` can be processed in the next rare-token pass.</violation>
</file>

_{Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic}

cubic-dev-ai · 2026-05-26T12:49:19Z

+CONDITION_SPECS = {"ee7": (7, 7)}  # (EARLY_EXIT_N, SCORE_LAYERS)
+
+
+def start_server(condition, ctx, log_path, compression_mode="always", keep_ratio=0.05):


P2: Context sweep parameter ctx is passed to start_server but ignored; --max-ctx is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_sweep_multihop.py, line 30: <comment>Context sweep parameter `ctx` is passed to `start_server` but ignored; `--max-ctx` is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size</comment> <file context> @@ -0,0 +1,276 @@ +CONDITION_SPECS = {"ee7": (7, 7)} # (EARLY_EXIT_N, SCORE_LAYERS) + + +def start_server(condition, ctx, log_path, compression_mode="always", keep_ratio=0.05): + # server accepts off|auto|always; allow "none" as alias for "off" + srv_compression = "off" if compression_mode == "none" else compression_mode </file context>

cubic-dev-ai · 2026-05-26T12:49:19Z

+    // Cascade loop: expand pool with newly-forced tokens and re-scan.
+    std::vector<uint8_t> prev_forced;
+    for (int it = 0; it < max_iters; ++it) {
+        prev_forced = forced;


P2: prev_forced is refreshed at the top of every transitive iteration, making max_iters > 1 ineffective and preventing multi-iteration cascade expansion. For it > 0, forced[c] && !prev_forced[c] is always false because prev_forced = forced was just executed, so the rare-token worklist is empty and the loop breaks at forced == prev_forced before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from scan_and_force can be processed in the next rare-token pass.

Prompt for AI agents

Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3/anchor_scan.cpp, line 105: <comment>`prev_forced` is refreshed at the top of every transitive iteration, making `max_iters > 1` ineffective and preventing multi-iteration cascade expansion. For `it > 0`, `forced[c] && !prev_forced[c]` is always false because `prev_forced = forced` was just executed, so the rare-token worklist is empty and the loop breaks at `forced == prev_forced` before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from `scan_and_force` can be processed in the next rare-token pass.</comment> <file context> @@ -0,0 +1,169 @@ + // Cascade loop: expand pool with newly-forced tokens and re-scan. + std::vector<uint8_t> prev_forced; + for (int it = 0; it < max_iters; ++it) { + prev_forced = forced; + + // Rare-token single-match: worklist-driven so cascades within a pass are </file context>

The committed CONDITION_SPECS in 0e55f48 picked up a temporary {"ee7": (14, 14)} override that was set during an ee14 sweep bench run, instead of the intended {"ee7": (7, 7)}. Restores ee7 as the default condition value. Caught by cubic-dev-ai inline review on PR Luce-Org#274.

…-7 at S%4096∈{1..7}) Root cause (codex + momus data audit): tail_lo = S - n_lookahead; old guard: tail_lo < cs + cl When S = k*chunk_size + r (r∈{1..7}, n_lookahead=8), tail_lo+n_lookahead = cs + cl + r > cs + cl — ggml_view_3d reads r tokens past allocation. Triggered deterministically at S mod 4096 ∈ {1..7}; ~0.2% of real prefills. Fix (2 lines, both sites in qwen3_graph.cpp:463 and :516): - if (tail_lo_nr >= cs && tail_lo_nr < cs + cl) + if (tail_lo_nr >= cs && tail_lo_nr + n_lookahead <= cs + cl) - if (tail_lo >= cs && tail_lo < cs + cl) + if (tail_lo >= cs && tail_lo + n_lookahead <= cs + cl) Semantics: skip tail-capture when n_lookahead tokens won't fit inside chunk. Unit test (test_drafter_tail_capture_guard) GREEN; all existing tests PASS. Resolves the 64K/128K NIAH crash-failures diagnosed in PR Luce-Org#274 review.

dusterbloom · 2026-05-26T12:57:22Z

@cubic-dev-ai re: prev_forced is refreshed at the top of every transitive iteration:

Looked at this carefully — the implementation handles the case you're describing, just differently than the diff suggests. Two pieces:

anchor_scan.cpp:115-120 seeds the worklist from forced directly on it==0, not from forced && !prev_forced. So pass-1's anchor results enter the rare-token worklist regardless of the snapshot.
anchor_scan.cpp:133 makes the worklist self-cascading within a single iteration. When a rare-token hop forces a new chunk, that chunk is push_back'd into the same worklist and processed by the same for-loop (line 121 indexes a growing vector by wi). The header comment on lines 107-108 says it: "worklist-driven so cascades within a pass are caught (e.g. hop3 forces hop2 which forces hop1 in one outer iteration)."

So the rare-token closure transitively exhausts in a single outer iteration by design. The purpose of it > 0 is the 4-gram re-scan at lines 148-159, which is gated by a fresh prev_forced reassignment on line 158 right before scan_and_force runs — so iteration N+1's rare-token phase does see new chunks from the 4-gram re-scan via the it==0 all-forced seeding mechanism applied at the start.

T6 in test_anchor_transitive.cpp:298-344 is the executable proof: it builds a 20-link rare-token chain with max_iters=25 and max_forced_count=5, and asserts the cap engages. If the worklist were empty after it==0 as your comment describes, the chain couldn't propagate past chunk 0 and the cap would never fire. T6 passes — the in-iteration worklist cascades chunks 0→1→...→N until the cap stops it.

Not a bug — resolving as by design. Will add a clarifying inline comment if the area is touched again.

cubic-dev-ai · 2026-05-26T12:57:41Z

@cubic-dev-ai re: prev_forced is refreshed at the top of every transitive iteration:

Looked at this carefully — the implementation handles the case you're describing, just differently than the diff suggests. Two pieces:

anchor_scan.cpp:115-120 seeds the worklist from forced directly on it==0, not from forced && !prev_forced. So pass-1's anchor results enter the rare-token worklist regardless of the snapshot.
...

@dusterbloom You're right — this is by design, not a bug. T6 is the definitive proof.

The two mechanisms you describe cover the concern fully:

it==0 seeding (lines 115-120): seeding from all forced chunks on the first iteration ensures pass-1's 4-gram anchor results enter the rare-token worklist immediately, regardless of the prev_forced snapshot. The snapshot only gates the 4-gram re-scan in subsequent iterations, not the rare-token cascade.
Self-cascading worklist (line 133 worklist.push_back inside the loop): the rare-token phase exhausts its own transitive closure within a single outer iteration because newly-forced chunks get appended to the same worklist and processed by the same wi loop. This is the same pattern as BFS with a FIFO queue — no need for multiple outer iterations to propagate across hops.

The it > 0 iterations are there for the 4-gram re-scan (lines 148-159), which does need prev_forced to detect delta between the expanded pool from the rare-token cascade and the previous round's forced set. That checkpoint on line 158 (prev_forced = forced) is correct — it happens after the rare-token phase completes for that iteration, so the next iteration's 4-gram scan sees the full set.

If T6 passes with a 20-link chain and a cap of 5, the implementation is sound. No bug here.

_{Tip: get faster answers by chatting with cubic’s review copilot in the review UI}

- Rename ttft_s → latency_s in run_longbench_hotpotqa.py (stream=False measures total completion time, not TTFT) - Same rename in run_ee_n_sweep_multihop.py (identical stream=False pattern found in audit) - Add empty-input guard in run_longbench_hotpotqa.py (sys.exit on empty cases) - Drop unused ctx parameter from run_ee_n_sweep_multihop.py start_server Cubic flags P2 quality issues from inline review on commit 0e55f48.

cubic-dev-ai · 2026-05-26T12:58:33Z

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

…s on Luce-Org#274) PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 is the production default after ee_n sweep: 6.9x@32K, 24.3x@128K, accept_rate +1.2 pp vs ee7. Reproduce via dflash/bench/run_ee_n_sweep.sh + run_ee_n_multiclient.sh.

Per-session ε-greedy bandit that adjusts compression based on observed accept_rate. Opt-in via session_id; clients without it get the existing fixed-keep path, byte-identical to main. Includes: - Bandit state machine (LRU-bounded session map, cap 1024) - HTTP server session_id parsing + bandit hook - accept_rate plumbing from DFlash GenerateResult - CI submodule PAT fix for fork PRs - Harness session_id env-var wiring 5-turn trajectory + NIAH @16K/32K + 3-seed A/B/C evidence (reproducible via the follow-up bench PR; not committed here). Bench scripts + result artifacts split to follow-up PR. Bug Luce-Org#42 tail-capture fix moved to PR Luce-Org#274.

Record the current squashed PR Luce-Org#274 head as integrated; resolved the CMake conflict by retaining the existing adaptive bandit tests from PR Luce-Org#264 while keeping the already-integrated early-exit drafter files.

Update the integration manifest after merging the latest PR Luce-Org#274 head (adaptive anchor radius and PFLASH_COMPRESS env rename). Record a fresh PR Luce-Org#266 worktree conflict attempt and current blocked classifications.

…g-42 tail-capture guard ee7 truncates drafter forward at layer 7 of 28, scoring only those layers. 9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter). Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF). Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}. 5 unit tests included. Bench scripts split to follow-up PR.

…de env vars)

At >=32K context the needle text is more likely to straddle multiple chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks ~160 tokens) loses the back half of the needle digits — the model retrieves '...is 4' but truncates/hallucinates the continuation. Adaptive scaling based on n_chunks: <32K context (<1024 chunks): radius=2, max_anchor_hits=8 (unchanged) 32-64K (1024-2047 chunks): radius=4, max_anchor_hits=16 >=64K (>=2048 chunks): radius=8, max_anchor_hits=32 Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS env vars (legacy DFLASH_COMPRESS_* names still accepted). Validated at 49K context: NIAH needle 'kowefada 1596346' correctly retrieved (was: '1594' or hallucinated 'is 048394839483' before fix). Resolves the long-standing 'project_64k_quality_cliff' memory entry.

Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35 target loader and the dflash decode draft loader. After loading weight tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] / wk->ne[1] and compare against GGUF-declared values; set_last_error and return false on mismatch. Makes the 'stale scalar at graph-build time' bug class structurally impossible. Load-time only, no runtime cost. Existing well-formed GGUFs are unaffected (smoke verified).

When pflash compresses, set gen_req.fa_window_override = effective_prompt + 256 so spec-decode verify sees the entire compressed prompt. Pflash already paid compute to pick which tokens matter; verify never throws any of them away. When the override would exceed 2 * cfg_.fa_window (spec-decode's drafter cost stops earning its tok/J), the C2 gate in qwen35_backend's generate() falls back to AR (fa_window=0, full attention). AR sees every kept token at every context; we choose mechanism, not visibility. Zero new CLI flags. --draft remains the only knob for composition; all per-request adaptation is internal.

…scade default-on Adds backwards-compat fallback wrappers for 6 cascade env vars in both standard and bandit code paths, so harness scripts using either spelling work against this binary. Emits one-time WARN to stderr when the legacy DFLASH_* spelling is honored. Also flips the default for `use_transitive` from `false` to `true` because the gated rare-token bridge improves multi-hop F1 with zero downside in the cascade-already-firing case.

…th drift Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both qwen35_score_and_compress and drafter_score_and_compress call it. Removes two 70-LOC duplicate env-reading blocks and the duplicated anchor-radius comment. Also removes dead force_chunk_neighborhood (no callers) and collapses the 4-overload load_drafter pyramid to one canonical implementation + 3 thin forwarders.

- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks), remove early_exit_n alias (inline early_exit_pre at call site) - qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref; do_ar_decode budget-hook 15-line → 4-line + docs ref - http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref - model_backend.h: BudgetHook 23-line essay → 3-line + docs ref - gguf_target_loader.cpp: 4-line prose tail → 1-line - .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts - docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md, anchor-transitive.md (consolidated rationale)

…nking is off The hard-coded renderer appends a closed think prefill when thinking is disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix, leaving the model in the wrong decoding state for tool use. Mirror the hard-coded behavior here when the rendered prompt ends with a bare assistant generation prompt; tolerate trailing-whitespace variants (single \n, double \n\n, trailing space). Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%: the drafter was distilled with the closed-think suffix in its training distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target and drafter disagree on what comes after <|im_start|>assistant\n.

… only The previous commit applied the closed-think suffix to all Jinja-rendered prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja, defaulting to QWEN3, and guard the post-processing block with arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe inherits ChatFormat::QWEN3 by design (matches drafter distillation). 5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3 arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append when template already closes the think block. Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.

Extract the C2 spec-decode gate from an inline expression in qwen35_backend.cpp into a pure predicate header c2_gate.h. Zero behavior change. Identical math: (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg) The new header documents the empirically-derived rationale: at compressed KV sizes (pflash compression of long prompts), T_draft/T_target ratio approaches 1, eliminating spec-decode's profit margin over AR. Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when eff_fa_window > 2*fa_window_cfg. Adds 5 unit tests locking in the predicate's behavior with explicit Round 5 4-arm matrix bench citations. Files: - server/src/qwen35/c2_gate.h (new) - server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call) - server/test/test_server_unit.cpp (+60 LOC, 5 tests)

…default 0.10) - Gate context-window admission on post-compression effective size, not raw, so >128K-raw prompts compress to fit max_ctx instead of 400 / oversized KV reservation. - Pre-compression keep-ratio sanity guard (raw*keep+max_out>max_ctx); the real effective-size gate runs post-compression in worker_loop. - Default prefill-keep-ratio 0.05 -> 0.10: real ~2x compression on agentic content (0.25 over-forces anchor-transitive to ~100% = no-op + rejects >128K). - Evidence (RTX3090, agentic replay, keep=0.10): 167K raw admitted -> 71K eff (42.6%), prefill 145s vs 845s forced; 32-128K real compression; tool-parse intact; 1629 unit asserts green; 14-cell P/PD sweep zero crashes.

…tion, default-off) Pure per-request router that picks pFlash compression per prompt instead of a global keep. Ships DISABLED: the default policy is an exact no-op, so enabling is opt-in and cannot regress recall. - regime_router.h: pure decide_regime (v1) + decide_v2 (type-gate + sparse-prompt guard + recency floor) + recency_floor_for() size-adaptive helper. 172 unit tests, -Werror clean. Default policy returns FullCascade/full-keep for any input. - qwen3_drafter.cpp: PFLASH_RECENCY_FLOOR_TOKENS mechanism (0 = off default; -1 = auto = min(1024, 0.04*S)); env_int negative-sentinel fix. - CMakeLists: test_regime_router target. Evidence (bench/2026-05-30_*): indexer-prune dead (frac_prune=0 across scorers); the splitter is prompt TYPE, not cascade-expansion (rho=-0.27); sparse-plumbing turns fail under compression but pass at full keep (OFF=100%), so route them to full keep. Recency floor validated zero-sum at fixed keep (breaks controls), kept as off-by-default infra, not the mechanism. Router path: agentic-rich -> ~25% keep (~3x prefill/wall), agentic-sparse -> full keep, retrieval -> cascade. Not yet wired into the live request handler (follow-up).

…n mechanisms Wires the compression-regime router into the request path (default-OFF via PFLASH_ROUTER_ENABLE, so this commit is a no-op until enabled) and prunes the router down to only what this session validated. Net -433 lines. Live path (when enabled): - detect agentic vs retrieval at admission (tool schemas / tool_use|tool_result blocks / tool_calls present). JSON-walking lives at the handler boundary (http_server.cpp); the router header stays pure (stdlib-only). - agentic -> cascade off, keep_target 0.25 (~3x prefill/wall, content permitting) - retrieval -> cascade on, full keep - per-request use_transitive threaded through the compress path. Pruned (mechanisms disproven this session, see bench/2026-05-30_*): - v1 cascade-expansion router (R): refuted as keep predictor, Spearman rho=-0.27. - recency-token floor: validated zero-sum at fixed keep (displaces top-K middle context, breaks controls). Removed from header + drafter. - decide_v2 sparse-prompt guard (new_content_tokens): does not separate fails-from-passes under compression. regime_router.h: 271 -> 96 LOC, zero non-stdlib includes, standalone -Werror, 43 unit tests (decide_v2 type-gate + detect_request_type truth-table). Known follow-ups (NOT in this commit): (1) hard empty/degenerate safety guard (result.tokens.size()/degenerate_decode_close already at the update site) to catch empty responses; (2) reconcile bandit keep range [0.025,0.20] vs router floor 0.25 (bandit currently wins per-session); (3) confident-off-task closure needs client/app feedback, not a label-free reward (Momus-confirmed).

…Luce-Org#10) Closes the two validated pieces of the adaptive-keep path (the label-free quality-reward idea was dropped — Momus-confirmed it can't catch confident off-task). Default-OFF; router gates these to agentic-routed requests. - regime_router.h: two pure helpers (stdlib-only, TDD'd) — clamp_keep_to_floor(bandit_keep, router_floor, agentic): agentic effective keep = max(bandit_keep, floor) so the bandit's 0.20 ceiling can no longer silently undercut the router's 0.25 floor. compression_failed(tokens, degenerate_close, agentic_compressed, min=8): true on empty/degenerate output of an agentic compressed turn. - adaptive_keep_ratio.h: per-session recover_full_next flag (+ set/consume). - http_server.cpp: floor clamp at keep-apply; at the post-generate update site, on compression_failed → skip the bandit update (failure noise) and set the session to full keep for the next turn (deterministic recovery from the empty-response failure class, e.g. LONG_B t10). PFLASH_GUARD_MIN_TOKENS env (default 8) tunes the guard threshold. - 59 standalone unit tests, -Werror. LIVE-VALIDATED on RTX 3090 (server up on :18097, 34K-token prompts): - type-gate: agentic→keep 0.250/cascade-off, retrieval→cascade-on. - guard recovery loop: turn1 compression_failed→full-keep-next (resp_tokens=13, bandit update skipped); turn2 same session recover_full_next consumed→keep 1.0. - floor clamp fired: agentic bandit 0.100 < floor 0.250 → 0.250. Launch config (24GB): GGML_CUDA_NO_VMM=1 + --max-ctx 49152 (139264 KV OOMs the 3090 — that was the pre-existing bad_alloc, not this change). Still default-OFF via PFLASH_ROUTER_ENABLE.

Merge latest feat/pflash-drafter-ee7 head and resolve conflicts by preserving the integration stack's pFlash curve/bandit/upstream/SWA behavior while adding the opt-in PFLASH_ROUTER_ENABLE regime router and standalone router test.

Correct the final included-head count after PR Luce-Org#274 advanced; the latest head replaced the prior included head, leaving 21 included and 6 non-ancestor ready PRs.

howard0su reviewed May 24, 2026

View reviewed changes

Comment thread dflash/src/qwen3/qwen3_loader.cpp Outdated

howard0su reviewed May 24, 2026

View reviewed changes

Comment thread dflash/bench/run_agentic_ee7_passbv.py Outdated

cubic-dev-ai Bot reviewed May 24, 2026

View reviewed changes

howard0su reviewed May 24, 2026

View reviewed changes

Comment thread server/src/common/score_range.h

cubic-dev-ai Bot reviewed May 24, 2026

View reviewed changes

Comment thread dflash/bench/run_tier1_proof.py Outdated

dusterbloom changed the title ~~feat(drafter): ee7 early-exit drafter saves up to 9.3× drafter wall at 128K~~ feat(pFlash): ee7 early-exit drafter saves up to 9.3× drafter wall at 128K May 24, 2026

cubic-dev-ai Bot reviewed May 24, 2026

View reviewed changes

Comment thread dflash/bench/run_ee_n_multiclient.sh Outdated

dusterbloom mentioned this pull request May 24, 2026

feat(drafter): ee3 as production default (depends on #274) #275

Draft

cubic-dev-ai Bot reviewed May 26, 2026

View reviewed changes

dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from 1ffc8e6 to 7810648 Compare May 27, 2026 07:13

dusterbloom mentioned this pull request May 27, 2026

feat(pflash): adaptive keep_ratio bandit MVP #264

Merged

davide221 mentioned this pull request May 27, 2026

chore(uv): refresh lock for relocated workspace members #283

Merged

2 tasks

dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from cde47cd to 5037b28 Compare May 27, 2026 13:14

dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from 5037b28 to 9c9aee9 Compare May 28, 2026 14:18

dusterbloom changed the title ~~feat(pFlash): ee7 early-exit drafter saves up to 9.3× drafter wall at 128K~~ feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS, NIAH cliff fix May 28, 2026

dusterbloom changed the title ~~feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS, NIAH cliff fix~~ feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS May 28, 2026

This was referenced May 28, 2026

pflash + dflash optimization on top of qwen35moe (PR #262) #280

Open

fix(chat_template): append closed <think> prefill in Jinja when thinking is off (Qwen3 only) #293

Closed

dusterbloom added 12 commits May 28, 2026 19:44

refactor(pflash): rename DFLASH_COMPRESS_* → PFLASH_COMPRESS_* (casca…

94907a4

…de env vars)

bench: add eval_quality_compare.py for LongBench F1 regression detection

766e46d

dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from 2c19f66 to e64a2b8 Compare May 28, 2026 17:55

dusterbloom changed the title ~~feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS~~ feat(pflash): prefill compression + >128K admission 2-12× prefill (content-dependent), decode at parity May 29, 2026

dusterbloom changed the title ~~feat(pflash): prefill compression + >128K admission 2-12× prefill (content-dependent), decode at parity~~ feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity May 29, 2026

dusterbloom added 3 commits May 31, 2026 10:15

		CONDITION_SPECS = {"ee7": (7, 7)} # (EARLY_EXIT_N, SCORE_LAYERS)


		def start_server(condition, ctx, log_path, compression_mode="always", keep_ratio=0.05):

Conversation

dusterbloom commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

Adaptive levers

Verified A/B (router ON vs OFF) — 2026-05-31

Reproducibility

Notes

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dusterbloom commented May 24, 2026

Uh oh!

Uh oh!

davide221 commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dusterbloom commented May 24, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

cubic-dev-ai Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

dusterbloom commented May 26, 2026

Uh oh!

cubic-dev-ai Bot commented May 26, 2026

Uh oh!

cubic-dev-ai Bot commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dusterbloom commented May 24, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

davide221 commented May 24, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading