Skip to content

feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity#274

Open
dusterbloom wants to merge 16 commits into
Luce-Org:mainfrom
dusterbloom:feat/pflash-drafter-ee7
Open

feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity#274
dusterbloom wants to merge 16 commits into
Luce-Org:mainfrom
dusterbloom:feat/pflash-drafter-ee7

Conversation

@dusterbloom
Copy link
Copy Markdown
Collaborator

@dusterbloom dusterbloom commented May 24, 2026

TL;DR

Adaptive composition stack: pflash compression + dflash spec-decode + ee7 drafter + anchor-transitive cascade + gated rare-token bridge + adaptive anchor_radius (64K NIAH cliff fix) + per-request fa_window_override + C2 mechanism gate + type-gate regime router (default-OFF).

Prefill speedup is content-dependent:

  • ~12× prefill-TPS on sparse/NIAH cells (Round5),
  • ~2.2× wall on dense agentic 64–128K (full-sweep 2026-05-29),
  • 3.00× prefill-TPS / 2.91× wall on real dense >32K agentic turns with the router ON (verified on this branch's tip, 2026-05-31 — see §Verified A/B).

Decode at parity (composition); +35% with the router on (shorter compressed KV).

Adaptive levers

  1. Adaptive anchor_radius (4a838dd): <32K=2 / 32–64K=4 / ≥64K=8 based on n_chunks. Eliminates the 64K NIAH cliff. Override via PFLASH_COMPRESS_ANCHOR_RADIUS.

  2. Per-request fa_window_override (2726608, Design 1): when pflash compresses, http_server sets fa_window_override = effective_prompt + 256. Never caps visibility.

  3. C2 mechanism gate (2726608 + refactor 548be0a): if eff_fa_window ≤ 2 × cfg_.fa_window (= 4096 default), spec-decode runs. Else AR fallback. Extracted to server/src/qwen35/c2_gate.h as a pure tested predicate + 5 unit tests. Zero behavior change.

  4. Effective-size admission gate + keep-ratio guard (8c1f37d): the context-window admission check gates on the post-compression effective size, not the raw prompt — a prompt larger than max_ctx that compresses to fit is admitted (previously a hard 400) without forcing an oversized KV reservation. Default --prefill-keep-ratio 0.05→0.10 (anchors yield ~40% effective; keep=0.25 over-forces the cascade to ~100% = near-no-op). Verified at >128K: raw 166,911 → eff 71,110 (42.6%), admitted at max_ctx=139264.

  5. Type-gate regime router (fbc2d41 / b31544f / 8fc961b, default-OFF behind PFLASH_ROUTER_ENABLE): correct-by-construction admission-time gate. server/src/common/regime_router.h is pure stdlib (96 LOC, 59 standalone -Werror tests). Logic:

    • agentic request (structured tools array / tool_use / tool_calls present) → keep floor 0.25, transitive cascade OFF. The cascade keys on rare-token co-occurrence, and the 151K drafter vocab makes ~every token "rare" → it over-forces dense agentic content to 70–100% keep (a near-no-op). Disabling it + flooring at 0.25 is what delivers the win below.
    • retrieval request → full keep, cascade ON (cascade is correct for QA/NIAH recall).
    • empty/degenerate guard: a compressed agentic turn that returns empty/degenerate skips the keep-bandit and one-shot recovers to full keep next turn (PFLASH_GUARD_MIN_TOKENS, default 8).
    • floor-clamp: the online keep-bandit (PR feat(pflash): adaptive keep_ratio bandit MVP #264) is clamped up to the 0.25 agentic floor so it can't undercut the router.

Verified A/B (router ON vs OFF) — 2026-05-31

Binary built from this branch tip (8fc961b), verified to contain both the admission-gate (dflash::common::check_admission symbol) and the router. 4 cold-prefill turns on real organic agentic sessions (claude-code transcripts, user turns at 34–37K est tokens, native tool blocks + tools array → agentic path fires). Fresh server per arm (true cold prefill). Run dir: bench/2026-05-31_definitive_ab/run_20260531_111334/.

Metric ROUTER OFF ROUTER ON Speedup
keep % 72.3% (44.7–100) 24.9% (25)
prefill_TPS 1248 3747 3.00×
wall_s 64.9 22.3 2.91×
decode_TPS 20.6 27.8 +35%
tool_parse 4/4 OK 4/4 OK held

All 4 ON turns routed type=agentic, keep=0.25, cascade=off. 0 floor-clamps, 0 guard fires. The admission-gate does not suppress the cascade over-forcing (OFF still 72% avg), so the router win is real on the shipped base. N=4 cold turns; numbers match the earlier-base measurement (2.85×/2.89×) within noise.

Reproducibility

./server/build/dflash_server /path/to/Qwen3.6-27B-Q4_K_M.gguf \
  --draft /path/to/dflash-draft-3.6-q4_k_m.gguf \
  --prefill-drafter /path/to/Qwen3-0.6B-Q8_0.gguf \
  --prefill-compression always --prefill-keep-ratio 0.10 \
  --prefill-threshold 32000 --max-ctx 139264 \
  --chat-template-file qwen3-coder-chat-template.jinja

Env: PFLASH_COMPRESS_ANCHOR_TRANSITIVE=1 PFLASH_DRAFTER_EARLY_EXIT_N=7 PFLASH_DRAFTER_SCORE_LAYERS=7 DFLASH27B_KV_K=tq3_0 DFLASH27B_KV_V=tq3_0. Router: add PFLASH_ROUTER_ENABLE=1.

Bench dashboard: bench/2026-05-28_adaptive_stack/Round5_4arm_matrix_174641/DASHBOARD.md. Public evidence: https://github.com/dusterbloom/pflash-evidence.

Notes

  • The "+47% decode TPS" figure is C_pflash_only vs A_baseline, not D_composition vs A_baseline. Pflash compression delivers it. Composition decode TPS is at parity-or-slightly-below pflash-only because the C2 gate (correctly) disables spec-decode on compressed KV.
  • The router (lever 5) is default-OFF; merging it changes no default behavior. It adds opt-in code + 401 LOC of tests.

Comment thread dflash/src/qwen3/qwen3_loader.cpp Outdated
Comment thread dflash/bench/run_agentic_ee7_passbv.py Outdated
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 issues found across 37 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread dflash/bench/run_skip_park_32k.py Outdated
Comment thread dflash/bench/skip_park_exp.sh Outdated
Comment thread dflash/bench/run_skip_park_32k.sh Outdated
Comment thread server/test/test_drafter_early_exit_score_range.cpp
Comment thread dflash/bench/results/2026-05-21_ee7_broad/SUMMARY.md Outdated
Comment thread dflash/bench/run_niah_ee7_longctx.py Outdated
Comment thread dflash/bench/run_skip_park_32k.sh Outdated
Comment thread dflash/bench/results/2026-05-22_multiclient_ee7/run_bench.sh Outdated
Comment thread dflash/bench/results/2026-05-21_ee7_broad/raw_results.json Outdated
Comment thread dflash/bench/run_early_exit_bench.py Outdated
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
Replace size-byte heuristic at qwen3_loader.cpp:109 with the
direct gguf_get_tensor_type(ctx, tensor_id) API as suggested
in PR Luce-Org#274 review. Return type of gguf_find_tensor is int64_t;
updated tidx declaration accordingly.
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
Line 301: dp50/base computes the inverse (slower = higher), should be
base/dp50 so that speedup > 1 means ee7 is faster. The headline 1.82x
and 9.29x numbers in PR Luce-Org#274 came from run_niah_ee7_broad.py and
run_niah_ee7_longctx.py which had the correct formula; this script's
printed table was not used as source data for the SUMMARY.md files.
@dusterbloom
Copy link
Copy Markdown
Collaborator Author

Addressed all cubic + howard0su review feedback. PR body updated with correct Q4_K_M target (was Q3_K_S — factual error) and per-context drafter speedup table (1K-128K + per-client agentic). Howard0su's two fixes applied (gguf_get_tensor_type, passbv docstring). All P1+P2 cubic items fixed across 6 commits.

Comment thread server/src/common/score_range.h
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
…eview)

Pure helper with dflash::common namespace was mis-located under
dflash/src/qwen3/. Move to dflash/src/common/ so qwen35, gemma4, and
future backends can use the same scoring-range logic without
reaching into the qwen3 source dir.
@davide221
Copy link
Copy Markdown
Contributor

davide221 commented May 24, 2026

why layer 7, and not some other layer? Env naming should be PFlash not DFlash to make it more clear that is the prefill drafter

@dusterbloom
Copy link
Copy Markdown
Collaborator Author

great question @davide221 I started removing from layer 14 then it worked. Then I push and it workd on layer 7. I stopped but actually am benching right now to see if it works on layer 5 or even 3. Let's see

As for the dFlash is a typo, agree to rename all for making sure it is clear this is about prefill!

dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
…RAFTER_* (PR Luce-Org#274 review)

- These envs control the pflash drafter (prefill-flash KV compression) early-exit
  path, not the dflash spec-decode drafter — DFLASH_ prefix was misleading.
- Pure rename across 15 files: qwen3_graph.cpp, score_range.h, all bench scripts,
  two historical SUMMARY docs, and the warm-path regression test.
- No alias kept — breaking change by design, same scope as PFLASH_DRAFTER_SLIM.
- Per davide221 review comment on PR Luce-Org#274.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 15 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread dflash/bench/run_tier1_proof.py Outdated
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
…YERS in run_tier1_proof.py (PR Luce-Org#274 review)

- cubic review caught: prior version set only SCORE_LAYERS so q8_l7 ran full 28 layers, invalidating the layer-subset measurement
- Also clears EARLY_EXIT_N in baseline/Q8 conditions to prevent shell-env leakage
- Historical tier1_proof results in 2026-05-21_tier1_proof/ are NOT being re-run — they served as the spike that discovered the scoring-clamp bug (fixed in be16529/90e6b27); preserved as historical record
- Going forward, anyone re-running the bench gets correct measurements
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
…-7 at S%4096∈{1..7})

Root cause (codex + momus data audit):
  tail_lo = S - n_lookahead; old guard: tail_lo < cs + cl
  When S = k*chunk_size + r (r∈{1..7}, n_lookahead=8), tail_lo+n_lookahead
  = cs + cl + r > cs + cl — ggml_view_3d reads r tokens past allocation.
  Triggered deterministically at S mod 4096 ∈ {1..7}; ~0.2% of real prefills.

Fix (2 lines, both sites in qwen3_graph.cpp:463 and :516):
  - if (tail_lo_nr >= cs && tail_lo_nr < cs + cl)
  + if (tail_lo_nr >= cs && tail_lo_nr + n_lookahead <= cs + cl)

  - if (tail_lo >= cs && tail_lo < cs + cl)
  + if (tail_lo >= cs && tail_lo + n_lookahead <= cs + cl)

Semantics: skip tail-capture when n_lookahead tokens won't fit inside chunk.
Unit test (test_drafter_tail_capture_guard) GREEN; all existing tests PASS.
Resolves the 64K/128K NIAH crash-failures diagnosed in PR Luce-Org#274 review.
@dusterbloom dusterbloom changed the title feat(drafter): ee7 early-exit drafter saves up to 9.3× drafter wall at 128K feat(pFlash): ee7 early-exit drafter saves up to 9.3× drafter wall at 128K May 24, 2026
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
- NIAH 3/3 at 32K/64K/128K for all four conditions; zero ggml_view_3d asserts
- ee3 drafter speedup: 6.9x@32K, 7.6x@64K, 24.3x@128K vs baseline
- Multi-client drafter_fwd mean: ee3=6.2x vs baseline, 1.84x vs ee7
- accept_rate unavailable (binary lacks bandit feature); wall_s for claude_code: ee3=1.08s vs ee7=1.29s
- Decision: ee3 passes all measurable gates; propose as new default after Luce-Org#274 merges
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 66 files (changes from recent commits).

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread dflash/bench/run_ee_n_multiclient.sh Outdated
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 24, 2026
…cleanup) — runner scripts stay

User directive: PR Luce-Org#274 ships ee7 source only; bench evidence not committed.
Runner scripts in dflash/bench/*.py and dflash/bench/results/*/run_bench*.sh preserved for reproducibility.
Result dirs under dflash/bench/results/ removed (SUMMARY.md, raw_results.json/jsonl, case*.json, vram.csv).
Equivalent results can be regenerated via the kept scripts.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found

You’re at about 90% of the monthly reviewed-line limit. You may want to disable incremental reviews to conserve quota. Reviews will continue until that limit is exceeded. If you need help avoiding interruptions, please contact contact@cubic.dev.

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="dflash/bench/run_ee_n_sweep_multihop.py">

<violation number="1" location="dflash/bench/run_ee_n_sweep_multihop.py:30">
P2: Context sweep parameter `ctx` is passed to `start_server` but ignored; `--max-ctx` is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size</violation>
</file>

<file name="dflash/src/qwen3/anchor_scan.cpp">

<violation number="1" location="dflash/src/qwen3/anchor_scan.cpp:105">
P2: `prev_forced` is refreshed at the top of every transitive iteration, making `max_iters > 1` ineffective and preventing multi-iteration cascade expansion. For `it > 0`, `forced[c] && !prev_forced[c]` is always false because `prev_forced = forced` was just executed, so the rare-token worklist is empty and the loop breaks at `forced == prev_forced` before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from `scan_and_force` can be processed in the next rare-token pass.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread pflash/tests/ruler_multihop_gen.py Outdated
Comment thread dflash/bench/run_longbench_hotpotqa.py Outdated
Comment thread dflash/bench/run_longbench_hotpotqa.py Outdated
Comment thread dflash/bench/run_longbench_hotpotqa.py Outdated
Comment thread dflash/bench/run_ee_n_sweep_multihop.py Outdated
CONDITION_SPECS = {"ee7": (7, 7)} # (EARLY_EXIT_N, SCORE_LAYERS)


def start_server(condition, ctx, log_path, compression_mode="always", keep_ratio=0.05):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Context sweep parameter ctx is passed to start_server but ignored; --max-ctx is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/bench/run_ee_n_sweep_multihop.py, line 30:

<comment>Context sweep parameter `ctx` is passed to `start_server` but ignored; `--max-ctx` is hardcoded to 139264, making all benchmark runs use the same allocation regardless of labeled context size</comment>

<file context>
@@ -0,0 +1,276 @@
+CONDITION_SPECS = {"ee7": (7, 7)}  # (EARLY_EXIT_N, SCORE_LAYERS)
+
+
+def start_server(condition, ctx, log_path, compression_mode="always", keep_ratio=0.05):
+    # server accepts off|auto|always; allow "none" as alias for "off"
+    srv_compression = "off" if compression_mode == "none" else compression_mode
</file context>

// Cascade loop: expand pool with newly-forced tokens and re-scan.
std::vector<uint8_t> prev_forced;
for (int it = 0; it < max_iters; ++it) {
prev_forced = forced;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: prev_forced is refreshed at the top of every transitive iteration, making max_iters > 1 ineffective and preventing multi-iteration cascade expansion. For it > 0, forced[c] && !prev_forced[c] is always false because prev_forced = forced was just executed, so the rare-token worklist is empty and the loop breaks at forced == prev_forced before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from scan_and_force can be processed in the next rare-token pass.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At dflash/src/qwen3/anchor_scan.cpp, line 105:

<comment>`prev_forced` is refreshed at the top of every transitive iteration, making `max_iters > 1` ineffective and preventing multi-iteration cascade expansion. For `it > 0`, `forced[c] && !prev_forced[c]` is always false because `prev_forced = forced` was just executed, so the rare-token worklist is empty and the loop breaks at `forced == prev_forced` before any iteration-1+ work can happen. The intended behavior requires carrying the pre-bottom-scan snapshot from the previous iteration so that newly-forced chunks from `scan_and_force` can be processed in the next rare-token pass.</comment>

<file context>
@@ -0,0 +1,169 @@
+    // Cascade loop: expand pool with newly-forced tokens and re-scan.
+    std::vector<uint8_t> prev_forced;
+    for (int it = 0; it < max_iters; ++it) {
+        prev_forced = forced;
+
+        // Rare-token single-match: worklist-driven so cascades within a pass are
</file context>

dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 26, 2026
The committed CONDITION_SPECS in 0e55f48 picked up a temporary {"ee7": (14, 14)}
override that was set during an ee14 sweep bench run, instead of the intended
{"ee7": (7, 7)}. Restores ee7 as the default condition value.

Caught by cubic-dev-ai inline review on PR Luce-Org#274.
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 26, 2026
…-7 at S%4096∈{1..7})

Root cause (codex + momus data audit):
  tail_lo = S - n_lookahead; old guard: tail_lo < cs + cl
  When S = k*chunk_size + r (r∈{1..7}, n_lookahead=8), tail_lo+n_lookahead
  = cs + cl + r > cs + cl — ggml_view_3d reads r tokens past allocation.
  Triggered deterministically at S mod 4096 ∈ {1..7}; ~0.2% of real prefills.

Fix (2 lines, both sites in qwen3_graph.cpp:463 and :516):
  - if (tail_lo_nr >= cs && tail_lo_nr < cs + cl)
  + if (tail_lo_nr >= cs && tail_lo_nr + n_lookahead <= cs + cl)

  - if (tail_lo >= cs && tail_lo < cs + cl)
  + if (tail_lo >= cs && tail_lo + n_lookahead <= cs + cl)

Semantics: skip tail-capture when n_lookahead tokens won't fit inside chunk.
Unit test (test_drafter_tail_capture_guard) GREEN; all existing tests PASS.
Resolves the 64K/128K NIAH crash-failures diagnosed in PR Luce-Org#274 review.
@dusterbloom
Copy link
Copy Markdown
Collaborator Author

@cubic-dev-ai re: prev_forced is refreshed at the top of every transitive iteration:

Looked at this carefully — the implementation handles the case you're describing, just differently than the diff suggests. Two pieces:

  1. anchor_scan.cpp:115-120 seeds the worklist from forced directly on it==0, not from forced && !prev_forced. So pass-1's anchor results enter the rare-token worklist regardless of the snapshot.

  2. anchor_scan.cpp:133 makes the worklist self-cascading within a single iteration. When a rare-token hop forces a new chunk, that chunk is push_back'd into the same worklist and processed by the same for-loop (line 121 indexes a growing vector by wi). The header comment on lines 107-108 says it: "worklist-driven so cascades within a pass are caught (e.g. hop3 forces hop2 which forces hop1 in one outer iteration)."

So the rare-token closure transitively exhausts in a single outer iteration by design. The purpose of it > 0 is the 4-gram re-scan at lines 148-159, which is gated by a fresh prev_forced reassignment on line 158 right before scan_and_force runs — so iteration N+1's rare-token phase does see new chunks from the 4-gram re-scan via the it==0 all-forced seeding mechanism applied at the start.

T6 in test_anchor_transitive.cpp:298-344 is the executable proof: it builds a 20-link rare-token chain with max_iters=25 and max_forced_count=5, and asserts the cap engages. If the worklist were empty after it==0 as your comment describes, the chain couldn't propagate past chunk 0 and the cap would never fire. T6 passes — the in-iteration worklist cascades chunks 0→1→...→N until the cap stops it.

Not a bug — resolving as by design. Will add a clarifying inline comment if the area is touched again.

@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 26, 2026

@cubic-dev-ai re: prev_forced is refreshed at the top of every transitive iteration:

Looked at this carefully — the implementation handles the case you're describing, just differently than the diff suggests. Two pieces:

  1. anchor_scan.cpp:115-120 seeds the worklist from forced directly on it==0, not from forced && !prev_forced. So pass-1's anchor results enter the rare-token worklist regardless of the snapshot.
    ...

@dusterbloom You're right — this is by design, not a bug. T6 is the definitive proof.

The two mechanisms you describe cover the concern fully:

  1. it==0 seeding (lines 115-120): seeding from all forced chunks on the first iteration ensures pass-1's 4-gram anchor results enter the rare-token worklist immediately, regardless of the prev_forced snapshot. The snapshot only gates the 4-gram re-scan in subsequent iterations, not the rare-token cascade.

  2. Self-cascading worklist (line 133 worklist.push_back inside the loop): the rare-token phase exhausts its own transitive closure within a single outer iteration because newly-forced chunks get appended to the same worklist and processed by the same wi loop. This is the same pattern as BFS with a FIFO queue — no need for multiple outer iterations to propagate across hops.

The it > 0 iterations are there for the 4-gram re-scan (lines 148-159), which does need prev_forced to detect delta between the expanded pool from the rare-token cascade and the previous round's forced set. That checkpoint on line 158 (prev_forced = forced) is correct — it happens after the rare-token phase completes for that iteration, so the next iteration's 4-gram scan sees the full set.

If T6 passes with a 20-link chain and a cap of 5, the implementation is sound. No bug here.

Tip: get faster answers by chatting with cubic’s review copilot in the review UI

dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 26, 2026
- Rename ttft_s → latency_s in run_longbench_hotpotqa.py (stream=False
  measures total completion time, not TTFT)
- Same rename in run_ee_n_sweep_multihop.py (identical stream=False pattern
  found in audit)
- Add empty-input guard in run_longbench_hotpotqa.py (sys.exit on empty cases)
- Drop unused ctx parameter from run_ee_n_sweep_multihop.py start_server

Cubic flags P2 quality issues from inline review on commit 0e55f48.
@cubic-dev-ai
Copy link
Copy Markdown
Contributor

cubic-dev-ai Bot commented May 26, 2026

You're iterating quickly on this pull request. To help protect your rate limits, cubic has paused automatic reviews on new pushes for now—when you're ready for another review, comment @cubic-dev-ai review.

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 26, 2026
…s on Luce-Org#274)

PFLASH_DRAFTER_EARLY_EXIT_N=3 PFLASH_DRAFTER_SCORE_LAYERS=3 is the production
default after ee_n sweep: 6.9x@32K, 24.3x@128K, accept_rate +1.2 pp vs ee7.
Reproduce via dflash/bench/run_ee_n_sweep.sh + run_ee_n_multiclient.sh.
@dusterbloom dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from 1ffc8e6 to 7810648 Compare May 27, 2026 07:13
dusterbloom added a commit to dusterbloom/lucebox-hub that referenced this pull request May 27, 2026
Per-session ε-greedy bandit that adjusts compression based on observed
accept_rate. Opt-in via session_id; clients without it get the existing
fixed-keep path, byte-identical to main.

Includes:
- Bandit state machine (LRU-bounded session map, cap 1024)
- HTTP server session_id parsing + bandit hook
- accept_rate plumbing from DFlash GenerateResult
- CI submodule PAT fix for fork PRs
- Harness session_id env-var wiring

5-turn trajectory + NIAH @16K/32K + 3-seed A/B/C evidence
(reproducible via the follow-up bench PR; not committed here).

Bench scripts + result artifacts split to follow-up PR.
Bug Luce-Org#42 tail-capture fix moved to PR Luce-Org#274.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
Record the current squashed PR Luce-Org#274 head as integrated; resolved the CMake conflict by retaining the existing adaptive bandit tests from PR Luce-Org#264 while keeping the already-integrated early-exit drafter files.
@dusterbloom dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from cde47cd to 5037b28 Compare May 27, 2026 13:14
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 27, 2026
Update the integration manifest after merging the latest PR Luce-Org#274 head (adaptive anchor radius and PFLASH_COMPRESS env rename). Record a fresh PR Luce-Org#266 worktree conflict attempt and current blocked classifications.
@dusterbloom dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from 5037b28 to 9c9aee9 Compare May 28, 2026 14:18
@dusterbloom dusterbloom changed the title feat(pFlash): ee7 early-exit drafter saves up to 9.3× drafter wall at 128K feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS, NIAH cliff fix May 28, 2026
@dusterbloom dusterbloom changed the title feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS, NIAH cliff fix feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS May 28, 2026
…g-42 tail-capture guard

ee7 truncates drafter forward at layer 7 of 28, scoring only those layers.
9.3× drafter wall at 128K (RTX 3090, Qwen3.6-27B-Q4_K_M target + Qwen2.5-0.5B-BF16 drafter).
Anchor-transitive cascade rescues multi-hop on bimodal-density prompts (gated, default OFF).
Bug Luce-Org#42 fix: tail-capture view-bounds guard at S%4096 in {1..7}.

5 unit tests included. Bench scripts split to follow-up PR.
At >=32K context the needle text is more likely to straddle multiple
chunks (chunk_size=32), and the fixed anchor_radius=2 window (5 chunks
~160 tokens) loses the back half of the needle digits — the model
retrieves '...is 4' but truncates/hallucinates the continuation.

Adaptive scaling based on n_chunks:
  <32K  context (<1024 chunks): radius=2,  max_anchor_hits=8   (unchanged)
  32-64K (1024-2047 chunks):    radius=4,  max_anchor_hits=16
  >=64K (>=2048 chunks):        radius=8,  max_anchor_hits=32

Override via PFLASH_COMPRESS_ANCHOR_RADIUS / PFLASH_COMPRESS_MAX_ANCHOR_HITS
env vars (legacy DFLASH_COMPRESS_* names still accepted).

Validated at 49K context: NIAH needle 'kowefada 1596346' correctly
retrieved (was: '1594' or hallucinated 'is 048394839483' before fix).
Resolves the long-standing 'project_64k_quality_cliff' memory entry.
Mirror the gemma4_backend.cpp:75-104 defensive pattern for the qwen35
target loader and the dflash decode draft loader. After loading weight
tensors, derive head_dim / n_head / n_head_kv from wq->ne[1] /
wk->ne[1] and compare against GGUF-declared values; set_last_error
and return false on mismatch.

Makes the 'stale scalar at graph-build time' bug class structurally
impossible. Load-time only, no runtime cost. Existing well-formed
GGUFs are unaffected (smoke verified).
When pflash compresses, set gen_req.fa_window_override =
effective_prompt + 256 so spec-decode verify sees the entire
compressed prompt. Pflash already paid compute to pick which tokens
matter; verify never throws any of them away.

When the override would exceed 2 * cfg_.fa_window (spec-decode's
drafter cost stops earning its tok/J), the C2 gate in
qwen35_backend's generate() falls back to AR (fa_window=0, full
attention). AR sees every kept token at every context; we choose
mechanism, not visibility.

Zero new CLI flags. --draft remains the only knob for composition;
all per-request adaptation is internal.
…scade default-on

Adds backwards-compat fallback wrappers for 6 cascade env vars in both
standard and bandit code paths, so harness scripts using either spelling
work against this binary. Emits one-time WARN to stderr when the legacy
DFLASH_* spelling is honored.

Also flips the default for `use_transitive` from `false` to `true` because
the gated rare-token bridge improves multi-hop F1 with zero downside in
the cascade-already-firing case.
…th drift

Single helper reads all 10 PFLASH_*/DFLASH_* env vars once. Both
qwen35_score_and_compress and drafter_score_and_compress call it.
Removes two 70-LOC duplicate env-reading blocks and the duplicated
anchor-radius comment. Also removes dead force_chunk_neighborhood
(no callers) and collapses the 4-overload load_drafter pyramid to
one canonical implementation + 3 thin forwarders.
- qwen3_graph.cpp: collapse 18-line alg-note, trim VRAM prose (3 blocks),
  remove early_exit_n alias (inline early_exit_pre at call site)
- qwen35_backend.cpp: C2 gate 9-line → 2-line + docs ref;
  do_ar_decode budget-hook 15-line → 4-line + docs ref
- http_server.cpp: Design 1 rationale 13-line → 2-line + docs ref
- model_backend.h: BudgetHook 23-line essay → 3-line + docs ref
- gguf_target_loader.cpp: 4-line prose tail → 1-line
- .gitignore: ignore *.git-head / *.pre-pflash-rename workdir artifacts
- docs/: pflash-compress-cfg.md, pflash-adaptive-composition.md,
  anchor-transitive.md (consolidated rationale)
…nking is off

The hard-coded renderer appends a closed think prefill when thinking is
disabled. Some Qwen3.6 Jinja templates omit that final assistant suffix,
leaving the model in the wrong decoding state for tool use. Mirror the
hard-coded behavior here when the rendered prompt ends with a bare
assistant generation prompt; tolerate trailing-whitespace variants
(single \n, double \n\n, trailing space).

Diagnosed by Round 5b D peer-chat showing dflash drafter accept_rate=0.0%:
the drafter was distilled with the closed-think suffix in its training
distribution; the Unsloth Qwen3-Coder template doesn't emit it, so target
and drafter disagree on what comes after <|im_start|>assistant\n.
… only

The previous commit applied the closed-think suffix to all Jinja-rendered
prompts. Add arch_hint (ChatFormat) parameter to render_chat_template_jinja,
defaulting to QWEN3, and guard the post-processing block with
arch_hint == ChatFormat::QWEN3. Call site in http_server.cpp passes
chat_format_ so other archs (Laguna, Gemma4) are unaffected. qwen35moe
inherits ChatFormat::QWEN3 by design (matches drafter distillation).

5 unit tests cover: thinking-off appends, thinking-on no-append, non-Qwen3
arch no-append (Laguna + Gemma4), qwen35moe inherits QWEN3, no double-append
when template already closes the think block.

Diagnosis + verification protocol in docs/pflash-drafter-template-alignment.md.
Extract the C2 spec-decode gate from an inline expression in
qwen35_backend.cpp into a pure predicate header c2_gate.h.

Zero behavior change. Identical math:
  (fa_window_override == 0) || (fa_window_override <= 2 * fa_window_cfg)

The new header documents the empirically-derived rationale: at
compressed KV sizes (pflash compression of long prompts), T_draft/T_target
ratio approaches 1, eliminating spec-decode's profit margin over AR.
Empirical at D_composition 128K replay: AR=27.5 tok/s vs forced
spec-decode=5.74 tok/s. The gate correctly blocks spec-decode when
eff_fa_window > 2*fa_window_cfg.

Adds 5 unit tests locking in the predicate's behavior with explicit
Round 5 4-arm matrix bench citations.

Files:
- server/src/qwen35/c2_gate.h (new)
- server/src/qwen35/qwen35_backend.cpp (+1 include, inline -> call)
- server/test/test_server_unit.cpp (+60 LOC, 5 tests)
@dusterbloom dusterbloom force-pushed the feat/pflash-drafter-ee7 branch from 2c19f66 to e64a2b8 Compare May 28, 2026 17:55
…default 0.10)

- Gate context-window admission on post-compression effective size, not raw, so
  >128K-raw prompts compress to fit max_ctx instead of 400 / oversized KV reservation.
- Pre-compression keep-ratio sanity guard (raw*keep+max_out>max_ctx); the real
  effective-size gate runs post-compression in worker_loop.
- Default prefill-keep-ratio 0.05 -> 0.10: real ~2x compression on agentic content
  (0.25 over-forces anchor-transitive to ~100% = no-op + rejects >128K).
- Evidence (RTX3090, agentic replay, keep=0.10): 167K raw admitted -> 71K eff (42.6%),
  prefill 145s vs 845s forced; 32-128K real compression; tool-parse intact; 1629 unit asserts green; 14-cell P/PD sweep zero crashes.
@dusterbloom dusterbloom changed the title feat(pflash): adaptive composition — 13× prefill TPS, +47% decode TPS feat(pflash): prefill compression + >128K admission 2-12× prefill (content-dependent), decode at parity May 29, 2026
@dusterbloom dusterbloom changed the title feat(pflash): prefill compression + >128K admission 2-12× prefill (content-dependent), decode at parity feat(pflash): prefill compress up to 128k -> 2-12× prefill (content-dependent), decode at parity May 29, 2026
…tion, default-off)

Pure per-request router that picks pFlash compression per prompt instead of a
global keep. Ships DISABLED: the default policy is an exact no-op, so enabling
is opt-in and cannot regress recall.

- regime_router.h: pure decide_regime (v1) + decide_v2 (type-gate + sparse-prompt
  guard + recency floor) + recency_floor_for() size-adaptive helper. 172 unit
  tests, -Werror clean. Default policy returns FullCascade/full-keep for any input.
- qwen3_drafter.cpp: PFLASH_RECENCY_FLOOR_TOKENS mechanism (0 = off default;
  -1 = auto = min(1024, 0.04*S)); env_int negative-sentinel fix.
- CMakeLists: test_regime_router target.

Evidence (bench/2026-05-30_*): indexer-prune dead (frac_prune=0 across scorers);
the splitter is prompt TYPE, not cascade-expansion (rho=-0.27); sparse-plumbing
turns fail under compression but pass at full keep (OFF=100%), so route them to
full keep. Recency floor validated zero-sum at fixed keep (breaks controls),
kept as off-by-default infra, not the mechanism. Router path: agentic-rich ->
~25% keep (~3x prefill/wall), agentic-sparse -> full keep, retrieval -> cascade.
Not yet wired into the live request handler (follow-up).
…n mechanisms

Wires the compression-regime router into the request path (default-OFF via
PFLASH_ROUTER_ENABLE, so this commit is a no-op until enabled) and prunes the
router down to only what this session validated. Net -433 lines.

Live path (when enabled):
- detect agentic vs retrieval at admission (tool schemas / tool_use|tool_result
  blocks / tool_calls present). JSON-walking lives at the handler boundary
  (http_server.cpp); the router header stays pure (stdlib-only).
- agentic -> cascade off, keep_target 0.25 (~3x prefill/wall, content permitting)
- retrieval -> cascade on, full keep
- per-request use_transitive threaded through the compress path.

Pruned (mechanisms disproven this session, see bench/2026-05-30_*):
- v1 cascade-expansion router (R): refuted as keep predictor, Spearman rho=-0.27.
- recency-token floor: validated zero-sum at fixed keep (displaces top-K middle
  context, breaks controls). Removed from header + drafter.
- decide_v2 sparse-prompt guard (new_content_tokens): does not separate
  fails-from-passes under compression.

regime_router.h: 271 -> 96 LOC, zero non-stdlib includes, standalone -Werror,
43 unit tests (decide_v2 type-gate + detect_request_type truth-table).

Known follow-ups (NOT in this commit): (1) hard empty/degenerate safety guard
(result.tokens.size()/degenerate_decode_close already at the update site) to
catch empty responses; (2) reconcile bandit keep range [0.025,0.20] vs router
floor 0.25 (bandit currently wins per-session); (3) confident-off-task closure
needs client/app feedback, not a label-free reward (Momus-confirmed).
…Luce-Org#10)

Closes the two validated pieces of the adaptive-keep path (the label-free
quality-reward idea was dropped — Momus-confirmed it can't catch confident
off-task). Default-OFF; router gates these to agentic-routed requests.

- regime_router.h: two pure helpers (stdlib-only, TDD'd) —
  clamp_keep_to_floor(bandit_keep, router_floor, agentic): agentic effective
    keep = max(bandit_keep, floor) so the bandit's 0.20 ceiling can no longer
    silently undercut the router's 0.25 floor.
  compression_failed(tokens, degenerate_close, agentic_compressed, min=8):
    true on empty/degenerate output of an agentic compressed turn.
- adaptive_keep_ratio.h: per-session recover_full_next flag (+ set/consume).
- http_server.cpp: floor clamp at keep-apply; at the post-generate update site,
  on compression_failed → skip the bandit update (failure noise) and set the
  session to full keep for the next turn (deterministic recovery from the
  empty-response failure class, e.g. LONG_B t10). PFLASH_GUARD_MIN_TOKENS env
  (default 8) tunes the guard threshold.
- 59 standalone unit tests, -Werror.

LIVE-VALIDATED on RTX 3090 (server up on :18097, 34K-token prompts):
- type-gate: agentic→keep 0.250/cascade-off, retrieval→cascade-on.
- guard recovery loop: turn1 compression_failed→full-keep-next (resp_tokens=13,
  bandit update skipped); turn2 same session recover_full_next consumed→keep 1.0.
- floor clamp fired: agentic bandit 0.100 < floor 0.250 → 0.250.
Launch config (24GB): GGML_CUDA_NO_VMM=1 + --max-ctx 49152 (139264 KV OOMs the
3090 — that was the pre-existing bad_alloc, not this change). Still default-OFF
via PFLASH_ROUTER_ENABLE.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Merge latest feat/pflash-drafter-ee7 head and resolve conflicts by preserving the integration stack's pFlash curve/bandit/upstream/SWA behavior while adding the opt-in PFLASH_ROUTER_ENABLE regime router and standalone router test.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Correct the final included-head count after PR Luce-Org#274 advanced; the latest head replaced the prior included head, leaving 21 included and 6 non-ancestor ready PRs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants