Skip to content

feat(qwen35moe): pooled chunked prefill + snapshot/restore over KVFlash#430

Open
dusterbloom wants to merge 17 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-moe-prefill-snapshot
Open

feat(qwen35moe): pooled chunked prefill + snapshot/restore over KVFlash#430
dusterbloom wants to merge 17 commits into
Luce-Org:mainfrom
dusterbloom:pr/kvflash-moe-prefill-snapshot

Conversation

@dusterbloom

@dusterbloom dusterbloom commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

What

Pooled chunked prefill for qwen35moe (Qwen3.6-35B-A3B) over KVFlash: when the
prompt exceeds the resident pool, prefill loops hybrid_forward_batch over
chunk-sized slices with live eviction instead of refusing. Plus pooled
snapshot/restore (save/restore the bounded pool across requests) and a
complexity-only refactor (dedup the two identical restore chunk loops, extract
chunked_prefill, inline a single-caller helper — net −25 LOC, behaviour-identical).

Stacking

This is the tip of the KVFlash-MoE stack and depends on:

Until those merge this PR's diff includes their commits; rebasing after they land
leaves only the prefill-snapshot + refactor commits.

Tests

test_kvflash_moe_paged.sh — GPU silent-corruption gate: a sink fact in the
first (protected) chunk is recalled after the middle is evicted, and the greedy
(temp-0) answer is identical across two pool sizes. Green on RTX 3090 / Q3_K_M.

Review in cubic

…max_ctx

The MoE expert placement reserved KV for max_ctx (10 GiB @131072) even with
--kvflash, forcing experts cold -> the pool was pure overhead. Reserve for the
resident pool instead when the full reservation would force experts cold, so
experts stay hot at high max_ctx (decouples max_ctx from the expert-placement
cliff). A post-init gate disables KVFlash when it is redundant (full KV already
fits all experts hot), keyed on all-hot-with-full-KV so it never disables a pool
that is itself keeping experts hot.

The rule is a shared pure helper (common/kvflash_placement.h) so future MoE
backends inherit it. Unit test (5 cases, no GPU) + hardware-gated integration
test (RTX 3090: 2203 cold -> 0 cold @max_ctx 131072, decode 43->66 tok/s).
Add serialize()/deserialize() to KvFlashPager (snapshot the full resident+paged
KV in logical chunk order; header-validated against layout) and a factored
for_each_segment() helper. serde uses synchronous get/set and adapts to the
pinned void* host_data of the async-DMA path (Luce-Org#408). Add critical-chunk pinning
(pin_range/is_pinned/unpin_all + a best-effort deadlock floor) OR-ed into the
ensure_free_block + reselect protections; empty by default (byte-identical
non-pin path). CPU unit test (no GPU) covers serde round-trip, header-guard
reject, pinning, deadlock guard, reset.
…r KVFlash

Drive the MoE cold-expert hybrid path through KVFlash's resident pool: prompts
larger than the pool prefill via a chunk loop over hybrid_forward_batch (eviction
automatic in alloc_span); the restore residual delta routes through the same
chunked path. Pooled snapshot save/restore serializes the pager into the prefix
snapshot (PrefixSnapshot += is_pooled + blob; snapshot_target_cache/restore gain
skip_kv; the blob rides the disk prefix-cache via a named tensor so cross-turn
128K restore composes). Drafter-scorer residency + DFLASH_KVFLASH_PIN_SPANS
critical-chunk pinning wired in. Composes with the landed KVFlash (Luce-Org#373/Luce-Org#408/Luce-Org#385)
and MoE restore (Luce-Org#362); serde adapts to the async pinned host_data.

GPU gate (RTX 3090): pooled prefill preserves sink context + stable across pool
sizes; cross-turn disk restore round-trips losslessly.
…gment

Three complexity cuts, no behavior change (GPU sink-recall gate + serde/
placement unit tests green):
- merge restore residual's identical snap_pooled/else chunk loops into one
  (the else ct ternary already subsumes the pooled case)
- extract chunked_prefill() shared by generate_impl kvf_paged + restore residual
- inline single-caller for_each_segment template into serialize

net -25 lines (54 ins / 79 del).

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 issues found across 16 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/test/test_kvflash_placement.cpp">

<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.h">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>

<file name="server/src/qwen35/qwen35_target_graph.cpp">

<violation number="1" location="server/src/qwen35/qwen35_target_graph.cpp:1572">
P2: Blob refresh on reuse can silently drop KVFlash data when blob presence changes, because no blob tensor is created outside the alloc path.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>

<file name="server/test/test_kvflash_moe_paged.sh">

<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>

<file name="server/src/common/moe_hybrid_ffn_eval.cpp">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

// Restore snapshot (skip KV copy when pooled; pager handles KV separately).
const PrefixSnapshot & snap_ref = prefix_snapshots_[slot];
const bool snap_pooled = snap_ref.is_pooled;
restore_target_cache(snap_ref, cache_, snap_pooled);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 899:

<comment>restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</comment>

<file context>
@@ -851,16 +893,29 @@ GenerateResult Qwen35Backend::restore_and_generate_impl(int slot,
+    // Restore snapshot (skip KV copy when pooled; pager handles KV separately).
+    const PrefixSnapshot & snap_ref = prefix_snapshots_[slot];
+    const bool snap_pooled = snap_ref.is_pooled;
+    restore_target_cache(snap_ref, cache_, snap_pooled);
+
+    // Pooled restore: rebuild pager from blob so KV rows are accessible.
</file context>
Suggested change
restore_target_cache(snap_ref, cache_, snap_pooled);
if (!restore_target_cache(snap_ref, cache_, snap_pooled)) {
result.error = "restore";
out_io.emit(-1);
return result;
}

int32_t next = 0;
for (int s = 0; s < n_used; ++s) {
if (hot_wts[base + s] > 0.0f) continue;
while ([&]{ for (int k=0; k<n_used; ++k) if (k!=s && hot_sel[base+k]==next) return true; return false; }())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.cpp, line 1076:

<comment>This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</comment>

<file context>
@@ -1066,6 +1066,19 @@ static bool eval_moe_hybrid_ffn_batched_core(
+            int32_t next = 0;
+            for (int s = 0; s < n_used; ++s) {
+                if (hot_wts[base + s] > 0.0f) continue;
+                while ([&]{ for (int k=0; k<n_used; ++k) if (k!=s && hot_sel[base+k]==next) return true; return false; }())
+                    if (++next >= n_hot_init) next = 0;
+                hot_sel[base + s] = next++;
</file context>

Comment thread server/src/qwen35/qwen35_target_graph.cpp
kill -0 "$pid" 2>/dev/null || break
sleep 2
done
curl -fsS "http://$HOST:$PORT/v1/chat/completions" -H 'Content-Type: application/json' \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Don't use || true to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_kvflash_moe_paged.sh, line 61:

<comment>Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</comment>

<file context>
@@ -0,0 +1,83 @@
+        kill -0 "$pid" 2>/dev/null || break
+        sleep 2
+    done
+    curl -fsS "http://$HOST:$PORT/v1/chat/completions" -H 'Content-Type: application/json' \
+        --data @"$REQ" 2>/dev/null \
+        | python3 -c 'import sys,json; print(json.load(sys.stdin)["choices"][0]["message"]["content"])' \
</file context>

// qwen3.6-35B-A3B-like budget on a 24 GiB card:
// ~80 KiB/token KV (5 GiB @ 65536, 10 GiB @ 131072)
// experts ~13.19 GiB, core ~3.12 GiB, draft ~1.2 GiB present.
const uint64_t MiB = 1024ull * 1024;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Missing #include <cstdint> for uint64_t. Test file relies on transitive include from the header kvflash_placement.h, which makes it fragile against future header cleanup.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_kvflash_placement.cpp, line 26:

<comment>Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</comment>

<file context>
@@ -0,0 +1,85 @@
+    // qwen3.6-35B-A3B-like budget on a 24 GiB card:
+    //   ~80 KiB/token KV  (5 GiB @ 65536, 10 GiB @ 131072)
+    //   experts ~13.19 GiB, core ~3.12 GiB, draft ~1.2 GiB present.
+    const uint64_t MiB = 1024ull * 1024;
+    const uint64_t GiB = 1024ull * MiB;
+    const uint64_t kv_per_tok = 80 * 1024;            // bytes/token
</file context>


// Persistent pipelined state (initialized once, reused across requests)
std::unique_ptr<struct PipelinedDecodeState> pipe_state_;
std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: New private members are unused dead code (hybrid_spec_graph_cache_, spec_microbench_done_). Drop them until the cache/microbench path is actually implemented.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.h, line 111:

<comment>New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</comment>

<file context>
@@ -83,13 +96,20 @@ class Qwen35MoeBackend : public Qwen35Backend {
 
     // Persistent pipelined state (initialized once, reused across requests)
     std::unique_ptr<struct PipelinedDecodeState> pipe_state_;
+    std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_;
+    bool spec_microbench_done_ = false;
     bool ensure_pipe_state(int kv_start);
</file context>

…ectness fixes

DRAFTER CONVERTER (config-driven):
- convert_dflash_to_gguf.py reads all architecture params from config.json
  (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for
  SWA, sliding_window). No hardcoded constants.
- quantize_draft_q8.py shares load_arch with the converter.
- GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[],
  dflash.block_size, attention.sliding_window + pattern.
- draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF
  metadata, block_size override, SWA pattern from metadata.
- draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from
  config.json.
- gguf_target_loader.cpp: respect drafter-specified capture layers instead
  of overwriting with evenly-spaced heuristic.
- qwen35_backend.cpp: early-read capture sync + mask token propagation.
- internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16.
- dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16.

SPEC-DECODE PERFORMANCE:
- graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx
  alive + n_tokens matches (centralized guard; was per-call-site).
- qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for
  graph persistence; kFastRollbackThreshold env-tunable
  (DFLASH_FAST_ROLLBACK_MIN, default 5).
- dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard
  (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB).
- graph_builders.cpp: 4MB ctx alloc (was 64MB).
- step_graph.h: graph_ctx_len + graph_used_view tracking fields.

SPEC-DECODE CORRECTNESS:
- qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded
  4096 feature ring cap. Default 4096 causes acceptance collapse from 85%
  to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features).
- qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP.
- qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax
  at long context (NaN/Inf → clamp to 0, verify rejects gracefully).

MOE EXPERIMENTAL (behind flags):
- qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid
  storage even with 0 cold experts to enable pipelined spec-decode verify.
- Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in
  hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback).
- Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards).

VALIDATED:
- 27B dense + reconverted drafter: 57% accept on code gen, 85% on short
  prompts. block=16 gives 252 tok/s (2.2x AR) on code generation.
- 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR).
- Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K.
- Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls
  valid), pFlash cuts 34K prefill from 475s to 208s (2.3x).
- repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.
…ash env vars

- DD path: dflash-draft-3.6-bf16-reconverted.gguf (old GGUF had garbage metadata)
- DFLASH_DRAFT_BLOCK_SIZE=16 (model card sweet spot)
- DFLASH_FEAT_RING_CAP=16384 (default 4096 collapses acceptance at the ring boundary)
… full ctx

- drafter GGUF baked rope.freq_base=1M but trains/serves at the target's 10M
  (converter bug); the unpark guard only corrected the 8-layer drafter, so the
  6-layer drafter ran at 1M vs target 10M. Align dw_.rope_theta to the target at
  both load sites (initial + unpark).
- DFLASH_FEAT_RING_CAP default 4096 wrapped the target-feature ring above 4K ctx,
  feeding the drafter stale features and collapsing accept to 0.1% at 27K. Default
  to max_ctx so the ring covers the full reserved context; env lowers it for VRAM.
- both restore dFlash spec-decode acceptance on long-context MoE (0.1% -> ~16% on
  27K agentic; content-dependent ceiling otherwise).
- harness: repo_inspection path dflash/->server/ (repo renamed in 39fe251);
  run_claude_code flags fixed to --allowedTools/--dangerously-skip-permissions
  (the old --tools/--permission-mode dontAsk are invalid on claude-code 2.x);
  session_inject_proxy gains --force-temperature, thinking injection and body dump
  for bench control; add qwen35moe dflash gate harness.
dFlash spec-decode is content-dependent: it wins big on verbatim/copyable
output (drafter accept ~80%, ~235 tok/s) but is 2-4x SLOWER than plain AR on
novel/high-entropy output (accept ~6-16%) — and on this MoE the rejected tokens
still pay full expert-routing verify cost. Gate it on target entropy so the
decoder automatically picks the faster path, transparently, no knobs.

- per decision point compute target top-1 prob p1 (cheap entropy proxy = expected
  acceptance) from the logits we already have.
- keep spec at the trained full block (16) when confidence is high; floor the
  remainder of the turn to the efficient do_ar_decode (real AR ~100+ tok/s) when
  the drafter is losing.
- hysteresis: 1-step probe + sustained-low streak (DFLASH_ENTROPY_SUSTAIN, def 2)
  holds full blocks through transient dips ("big blocks on uncertain transitions");
  near-tie immediate floor (DFLASH_ENTROPY_TIE_P1, def 0.45) turns verify off when
  the argmax is ambiguous.
- threshold DFLASH_ENTROPY_AR_P1 (def 0.90) swept for the Pareto point; gate
  default-on, DFLASH_ENTROPY_GATE=0 disables, DFLASH_ENTROPY_DEBUG traces p1.
- measured: verbatim 236 / code-gen->AR 117 / novel->AR 83 tok/s, always >= AR.
- temp 0: semantically equivalent to AR (spec verifies vs target argmax; both take
  the argmax). Not bit-identical — near-tie argmax flips via verify-batch FP
  reduction order, the established spec-decode bar.
…after cliff

Two changes that make dFlash spec-decode safe and useful across content and
context length without per-model tuning.

1. Long-context drafter cliff fix. The block-diffusion drafter's prediction
   collapses when it self-attends more than ~2048 tokens (measured: 93% accept
   at draft_ctx<=2048 vs 6% at 4096, independent of total prompt context). The
   old default ran it at max(2048, draft_ctx_max=4096)=4096 — past the drafter's
   effective limit — so spec-decode died above ~2K context. Cap the drafter's
   self-attention at 2048 by default; spec now holds 77-93% accept / 110-200
   tok/s out to 35K context for recent-derived output. DFLASH_DRAFT_CTX_MAX
   overrides for drafters with a larger usable window.

2. Self-calibrating commit-EMA gate (replaces the p1-entropy gate). dFlash wins
   only when its realized throughput beats AR; that break-even is model- and
   context-dependent (a fixed entropy threshold over-floored dense, under-floored
   MoE). Measure t_ar once per process (cached on the backend, no per-turn warmup
   tax), then floor the remainder of a turn to the efficient AR path when the EMA
   of commit_n*t_ar/step_wall stays below 1.0 (spec slower than AR) for a few
   steps. Knob-free, never slower than AR; floors novel/high-entropy turns,
   keeps spec on code/structured. Env: DFLASH_SPEC_GATE(=1), _MARGIN, _SUSTAIN,
   _WARMUP, _DEBUG. Applies to both base (do_spec_decode) and MoE hybrid
   (do_hybrid_spec_decode) paths. Temp 0: semantically equivalent to AR.
…code; bound MoE prefill sync

- root cause: the long-context accept cliff was the target-feature ring cap
  (FEAT_RING_CAP), NOT a drafter 2048 self-attention limit. When prompt_tokens >
  ring_cap the ring wraps, the drafter cross-attends stale features, and the
  commit-EMA gate floors to AR. Verified by a fully-crossed draft_ctx x ring_cap
  2x2 (momus-reviewed).
- ring_cap must be >= max_ctx (mandatory for correctness); the shipped RECIPE's
  FEAT_RING_CAP=4096 reintroduced the cliff.
- DRAFT_CTX_MAX=2048 was an amputation, not a fix: needle-in-middle control shows
  it craters distant recall ~46pp (marker 2.6K from end: 76.8% -> 30.9%).
  draft_ctx=8192 is the VRAM-max uncrippled window (~4x recall reach), with
  DFLASH_FEATURE_DTYPE=f16 freeing the mirror headroom.
- KV quant does not move the draft_ctx ceiling (q4/q8/tq3 all cap at 8192) -- the
  limit is the draft compute graph's decode scratch, not KV reservation;
  draft_ctx>8192 is self-defeating (gate floors to AR even when f16 makes it fit).
- bound the MoE prefill feature-sync at qwen35moe_backend.cpp:1205 to
  min(committed, cap) (was raw committed, silently no-ops when committed>cap ->
  stale features), matching the restore-path and base-path patterns.
- add ctxsweep harness + prompt fixtures + analysis docs documenting the 2x2,
  KV grid, and needle controls.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

12 issues found across 53 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_ffn_eval.cpp">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>

<file name="server/test/test_kvflash_placement.cpp">

<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.h">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>

<file name="server/src/qwen35/qwen35_target_graph.cpp">

<violation number="1" location="server/src/qwen35/qwen35_target_graph.cpp:1572">
P2: Blob refresh on reuse can silently drop KVFlash data when blob presence changes, because no blob tensor is created outside the alloc path.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>

<file name="server/test/test_kvflash_moe_paged.sh">

<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>

<file name="bench/abc_cache_harness/replay_harness.py">

<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>

<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>

<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>

<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>

<file name="server/src/qwen35/gguf_target_loader.cpp">

<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>

<file name="server/src/draft/draft_gguf_loader.cpp">

<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>

<file name="harness/clients/session_inject_proxy.py">

<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.

(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>

<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>

<file name="harness/clients/run_claude_code.sh">

<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>

<file name="bench/qwen35moe_dflash/RECIPE.md">

<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

// Read target_layer_ids array (exact capture positions from training).
std::snprintf(key, sizeof(key), "%s.%s", A.c_str(), "dflash.target_layer_ids");
int64_t tli_id = gguf_find_key(gctx, key);
if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: target_layer_ids element type is not validated before casting to int32_t*. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 158:

<comment>`target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</comment>

<file context>
@@ -117,6 +117,65 @@ int count_swa_layers(const DraftWeights & w) {
+    // Read target_layer_ids array (exact capture positions from training).
+    std::snprintf(key, sizeof(key), "%s.%s", A.c_str(), "dflash.target_layer_ids");
+    int64_t tli_id = gguf_find_key(gctx, key);
+    if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY) {
+        const size_t n = std::min((size_t)gguf_get_arr_n(gctx, tli_id),
+                                  (size_t)max_ids);
</file context>
Suggested change
if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY) {
if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY &&
gguf_get_arr_type(gctx, tli_id) == GGUF_TYPE_INT32) {

while not log_path.exists() and time.time() < deadline:
time.sleep(1)

cache_off = done_off = spec_off = ar_off = pflash_off = survival_off = 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/replay_harness.py, line 723:

<comment>Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</comment>

<file context>
@@ -0,0 +1,1361 @@
+    while not log_path.exists() and time.time() < deadline:
+        time.sleep(1)
+
+    cache_off = done_off = spec_off = ar_off = pflash_off = survival_off = 0
+
+    results = []
</file context>

// If N changed from default 5, the IDs were definitely set by
// early-read and should be respected.
const bool was_early_read = (N != DFLASH27B_DRAFT_N_TARGET_LAYERS);
if (was_early_read) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/gguf_target_loader.cpp, line 480:

<comment>Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</comment>

<file context>
@@ -463,12 +463,41 @@ bool load_target_gguf_partial(const std::string & path,
+            // If N changed from default 5, the IDs were definitely set by
+            // early-read and should be respected.
+            const bool was_early_read = (N != DFLASH27B_DRAFT_N_TARGET_LAYERS);
+            if (was_early_read) {
+                std::printf("[loader] using drafter-specified capture layers (%d)\n", N);
+            } else {
</file context>

obj["extra_body"]["session_id"] = self.session_id
if self.force_temperature is not None:
obj["temperature"] = self.force_temperature
if self.think_budget and path.startswith("/v1/messages"):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: think_budget uses truthiness, so 0 is treated as "unset" and skips thinking injection for /v1/messages.

(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.)

View Feedback

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/session_inject_proxy.py, line 125:

<comment>`think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.

(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) </comment>

<file context>
@@ -99,14 +102,28 @@ def do_POST(self):
+                        obj["extra_body"]["session_id"] = self.session_id
+                if self.force_temperature is not None:
+                    obj["temperature"] = self.force_temperature
+                if self.think_budget and path.startswith("/v1/messages"):
+                    obj["thinking"] = {"type": "enabled", "budget_tokens": self.think_budget}
                 body = json.dumps(obj).encode("utf-8")
</file context>

--model "$MODEL_ID" \
--tools "$CLAUDE_TOOLS" \
--permission-mode dontAsk \
--dangerously-skip-permissions \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: CLAUDE_TOOLS config is now ignored because --tools was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/run_claude_code.sh, line 79:

<comment>`CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</comment>

<file context>
@@ -69,9 +76,9 @@ timeout "${CLAUDE_TIMEOUT}s" "$CLAUDE_BIN" \
   --model "$MODEL_ID" \
-  --tools "$CLAUDE_TOOLS" \
-  --permission-mode dontAsk \
+  --dangerously-skip-permissions \
   --no-session-persistence \
+  "${CLAUDE_EXTRA[@]}" \
</file context>
Suggested change
--dangerously-skip-permissions \
--tools "$CLAUDE_TOOLS" \
--dangerously-skip-permissions \

str(SERVER_BIN),
str(TGT),
"--host", HOST,
"--port", str(PORT),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Configured --port is ignored when launching the server; server and client can target different ports.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/replay_harness.py, line 514:

<comment>Configured `--port` is ignored when launching the server; server and client can target different ports.</comment>

<file context>
@@ -0,0 +1,1361 @@
+        str(SERVER_BIN),
+        str(TGT),
+        "--host", HOST,
+        "--port", str(PORT),
+        "--max-ctx", str(MAX_CTX),
+        "--cache-type-k", ctk,
</file context>

| f16 | 18.0s | 174 | 76.8 | 12.86 |
| q4_0 | 18.0s | 167 | 76.8 | 12.86 |
| q8_0 | 18.1s | 143 | 66.4 | 11.25 |
| tq3_0 | 23.6s | 109 | 76.8 | 12.86 |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Truncated sentence in KV precision sweep analysis — f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4 cuts off mid-thought with no closing paren or wrap-up for the section.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/NOTES.md, line 51:

<comment>Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</comment>

<file context>
@@ -0,0 +1,56 @@
+| f16 | 18.0s | 174 | 76.8 | 12.86 |
+| q4_0 | 18.0s | 167 | 76.8 | 12.86 |
+| q8_0 | 18.1s | 143 | 66.4 | 11.25 |
+| tq3_0 | 23.6s | 109 | 76.8 | 12.86 |
+f16 best; q4_0 EQUAL (free VRAM saver, no accept/AL cost); q8_0 ANOMALOUS (lower accept 66.4
+## KVFlash added to the 35B agentic config?
</file context>


if not args.session_id:
print("[session-proxy] WARNING: no session_id set; proxy is pass-through only", flush=True)
if not args.session_id and args.force_temperature is None:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Startup warning is inaccurate when only THINK_BUDGET is configured. It can mislead debugging because proxy is not pass-through in that mode.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/session_inject_proxy.py, line 143:

<comment>Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</comment>

<file context>
@@ -120,19 +137,23 @@ def main():
 
-    if not args.session_id:
-        print("[session-proxy] WARNING: no session_id set; proxy is pass-through only", flush=True)
+    if not args.session_id and args.force_temperature is None:
+        print("[session-proxy] WARNING: no session_id or force_temperature set; proxy is pass-through only", flush=True)
 
</file context>

Comment thread bench/qwen35moe_dflash/RECIPE.md Outdated
Comment thread bench/qwen35moe_dflash/RECIPE.md Outdated
- ❌ `DFLASH_DRAFT_CTX_MAX` < 8192 — amputates distant recall (see recall-horizon table).
- ❌ a different `draft_ctx`/ring/rope without re-checking accept — these are the documented footguns (see GOTCHAS.md).

See `GOTCHAS.md` (same dir) for the full footgun list, `charbench/NOTES.md` and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/RECIPE.md, line 123:

<comment>Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</comment>

<file context>
@@ -0,0 +1,124 @@
+- ❌ `DFLASH_DRAFT_CTX_MAX` < 8192 — amputates distant recall (see recall-horizon table).
+- ❌ a different `draft_ctx`/ring/rope without re-checking accept — these are the documented footguns (see GOTCHAS.md).
+
+See `GOTCHAS.md` (same dir) for the full footgun list, `charbench/NOTES.md` and
+`ctxsweep/NOTES.md` for the supporting measurements.
</file context>

…ignored draft_ctx knob

Two methodology flaws in the prior sweeps invalidated part of the recipe; a
clean cold re-baseline (one prompt per fresh server, temp 0) establishes the truth.

- DFLASH_DRAFT_CTX_MAX is IGNORED on the MoE backend: qwen35moe_backend.cpp:2267
  caps draft_ctx at max(2048, cfg_.draft_ctx_max=4096); the getenv exists only in
  the dense qwen35 backend. Every 2048/8192/16384 sweep changed an ignored var —
  draft_ctx was always 4096. The "draft_ctx=8192 uncripples distant recall"
  narrative was warm-EMA request-ordering + variance, not draft_ctx.
- DFLASH_FEATURE_DTYPE=f16 floors spec-decode to AR on every prompt (quantizes the
  cross-attended target features). Dropped — f32 mirror required.
- Distant recall works at the pinned draft_ctx=4096 via the drafter's cross-attention
  to the target-feature ring; needle 12K-deep holds 28.7% + reproduces the marker.
  So FEAT_RING_CAP=max_ctx is the sole real lever (no draft_ctx env port needed).
- Corrected recipe: FEAT_RING_CAP=max_ctx + f32 mirror + q4_0 KV, nothing else.
- Honest cold numbers (ring=max_ctx, f32, q4_0): recent/copy 76.8% / AL 12.86 /
  ~172 tok/s through 35K (~2.0x the AR ~86 floor); distant 12K-deep ~29%. The prior
  "92.7%" was a warm-EMA artifact.
- Add clean_rebaseline.md (authoritative) + earlyexit/asym/f16-isolation evidence;
  banner the superseded draft_ctx-varying docs.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 13 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_ffn_eval.cpp">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>

<file name="server/test/test_kvflash_placement.cpp">

<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.h">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>

<file name="server/test/test_kvflash_moe_paged.sh">

<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>

<file name="bench/abc_cache_harness/replay_harness.py">

<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>

<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>

<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>

<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>

<file name="server/src/qwen35/gguf_target_loader.cpp">

<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>

<file name="server/src/draft/draft_gguf_loader.cpp">

<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>

<file name="harness/clients/session_inject_proxy.py">

<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.

(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>

<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>

<file name="harness/clients/run_claude_code.sh">

<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>

<file name="bench/qwen35moe_dflash/RECIPE.md">

<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json:89">
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

"mirror_cap": 40960,
"prompt": "needle_06k",
"status": "OK",
"accept_pct": 92.7,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json, line 89:

<comment>Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</comment>

<file context>
@@ -0,0 +1,130 @@
+    "mirror_cap": 40960,
+    "prompt": "needle_06k",
+    "status": "OK",
+    "accept_pct": 92.7,
+    "avg_commit": 14.83,
+    "decode_tps_spec": 220.57,
</file context>

…restore works

At ≥128K with the KVFlash pool active, turn 1 never saved a prefix snapshot —
the pooled-prefill branch was stubbed to a diagnostic ("boundary snapshot
skipped: pooled prefill relocates chunks") and returned without saving. So turn
2 found nothing to restore (prefix_len=0), fell back to a full cold re-prefill
(0.8s→77.6s), decode regressed 80→20 tok/s, and turn 3 crashed. The all-hot
35B-A3B runs the dense Qwen35Backend path (moe_hybrid==nullptr), so this was the
live bug for the user's deep-context (>128K = 39% of real prompts) workload.

- add KvFlashPager::serialize(max_chunks) to capture only chunks [0, max_chunks)
  — the chunk-aligned turn boundary, not the whole prompt.
- add Qwen35Backend::snapshot_save_pooled_at(slot, boundary): floor the requested
  snap_pos to a chunk multiple, set cur_pos to that boundary, serialize the
  partial pager blob, and save it (the restore/deserialize path already existed
  and was correct — only the save was missing).
- replace the pooled-prefill skip stub at the chunk-aligned boundary with the
  real save; mirror the same save on the qwen35moe hybrid path.
- unit tests: floor_to_chunk + serialize(max_chunks) partial round-trip
  (bit-identical first k chunks).

131K 3-turn smoke: turn-2 restore=true prefix_len=34077 (97.5% hit), turn-3
restore=true, no crash, tool_call_valid=1.0, decode recovered 20→56-59 tok/s.
Known follow-up: warm-prefill at 131K is still ~44s (deserialize re-pages the
whole pool) — correctness/crash/decode are fixed; restoring only resident chunks
is the next optimization.
Placement called kvflash_pool_from_env(max_context) with default args, taking
the no-budget fallback (max_ctx/2). Runtime sizes the same pool with the real
VRAM budget + scorer policy, getting a speed-capped value (e.g. 16384). On
DFLASH_KVFLASH=auto at high max_ctx this over-reserved KV ~4x, under-budgeting
experts and reducing hot placement.

Extract make_kvflash_budget() + kvflash_scorer_expected() and call them from
both sites so reservation and runtime allocation size the pool identically.
Add a pure unit test pinning the budgeted vs no-budget divergence.

(cherry picked from commit 656accd)
@dusterbloom dusterbloom force-pushed the pr/kvflash-moe-prefill-snapshot branch from 9b501bd to fc90d1e Compare June 22, 2026 19:08
…raph-refactor scaffolding

The full evidence trail from the deep-dive on Qwen3.6-27B dFlash vs the published
Qwen3.5-27B blog, the user's real session distribution, and the CUDA-graph decode
refactor plan.

- 27B beat-blog (model_ab_3.6_vs_3.5, beat_blog_results): best Qwen3.6-27B-Q4 =
  124.8 tok/s mean (96.4% of the blog's 129.52) at --ddtree-budget 16, AL 11.15 vs
  blog 8.31 (+34% — our drafter accepts more). Per-step decomposition: verify cost
  identical (35ms); the 1.39x per-step gap is the 48 GatedDeltaNet SSM layers of the
  3.6 hybrid architecture (16 attn + 48 SSM of 64), not config or implementation.
  q8_0 drafter dead on Ampere (scalar fallback). The gap is the model, not us.
- Real session distribution (session_distribution + analyze_sessions): 117 sessions,
  median prompt 37 tok / max 119k, median CONTEXT 94k, 39% of prompts land >128k —
  the workload lives in deep context, which is why cache-persistence + long-context
  decode are the load-bearing levers, not HumanEval-short.
- Equity audit + AR-vs-dFlash scaling + dense-vs-MoE best-config: dense was
  under-benched (q4-only, no ddtree); decode is under-tuned not ceiling; synthetic
  copy prompts inflate dFlash ~1.6-2x vs real agentic.
- Graph-refactor scaffolding: bit_identity_gate.py (4K/32K/71K token-for-token AR
  gate) + thoughts/shared/plans/cuda_graph_replay_team_plan.md (token-sized A/C/D
  plan; A draft exists uncommitted, validated behind the gate before merge).
- New long-context prompt fixtures (57k/64k/128k) + clean bench drivers.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

33 issues found across 22 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>

<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">

<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">

<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>

<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
print(f"Server PID: {proc.pid}")

healthy = wait_healthy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py, line 204:

<comment>Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</comment>

<file context>
@@ -0,0 +1,328 @@
+    proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
+    print(f"Server PID: {proc.pid}")
+
+    healthy = wait_healthy()
+    if not healthy:
+        print("ERROR: Server did not become healthy within timeout")
</file context>


## Verdict

**The 15% gap is PRIMARILY THE MODEL, not the config.**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md, line 129:

<comment>Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</comment>

<file context>
@@ -0,0 +1,155 @@
+
+## Verdict
+
+**The 15% gap is PRIMARILY THE MODEL, not the config.**
+
+Evidence:
</file context>

return cmd


def launch_server(log_path):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: --run-server path omits the documented flock GPU lock because launch logic is duplicated and inconsistent between launch_server_cmd() and launch_server(). This can cause GPU contention and corrupt benchmark validity.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py, line 159:

<comment>`--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</comment>

<file context>
@@ -0,0 +1,586 @@
+    return cmd
+
+
+def launch_server(log_path):
+    """Spawn the server in a child process. Returns (proc, log_fh)."""
+    env = os.environ.copy()
</file context>


for line in lines:
line = line.strip()
if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: CUDA error detection is broken due to a case mismatch: line.lower() is checked against the mixed-case literal "CUDA error", so that branch can never match and CUDA errors may be missed.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 190:

<comment>CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</comment>

<file context>
@@ -0,0 +1,408 @@
+
+    for line in lines:
+        line = line.strip()
+        if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():
+            result["oom"] = True
+        if "[spec-decode]" in line and "tokens=" in line and "accepted=" in line:
</file context>

deadline = time.time() + timeout
while time.time() < deadline:
try:
result = subprocess.run(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Request failures are silently ignored; send_request does not check result.returncode, and run_cell never validates the response before extracting metrics.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 69:

<comment>Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</comment>

<file context>
@@ -0,0 +1,408 @@
+    deadline = time.time() + timeout
+    while time.time() < deadline:
+        try:
+            result = subprocess.run(
+                ["curl", "-sf", f"http://127.0.0.1:{port}/health"],
+                capture_output=True, text=True, timeout=5
</file context>

wall_s = parse_wall_s(parsed["server_done"])
prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
gate_line = parsed["spec_gate"]
is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: is_ar classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py, line 283:

<comment>`is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</comment>

<file context>
@@ -0,0 +1,437 @@
+    wall_s     = parse_wall_s(parsed["server_done"])
+    prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
+    gate_line  = parsed["spec_gate"]
+    is_ar      = parsed["spec_decode"] is None and parsed["ar_decode"] is None
+
+    gate_floor_reason = "N/A"
</file context>
Suggested change
is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None
is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is not None

action="append",
default=[],
metavar="ARG",
help="Extra arg to pass to BOTH server binaries (repeatable). "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 358:

<comment>Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</comment>

<file context>
@@ -0,0 +1,452 @@
+        action="append",
+        default=[],
+        metavar="ARG",
+        help="Extra arg to pass to BOTH server binaries (repeatable). "
+             "E.g. --extra-server-arg --cache-type-k --extra-server-arg f16",
+    )
</file context>

SEED = 42
N_GEN = 128 # decode tokens per probe
SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health
CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 61:

<comment>Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</comment>

<file context>
@@ -0,0 +1,452 @@
+SEED            = 42
+N_GEN           = 128            # decode tokens per probe
+SERVER_READY_TIMEOUT_S = 300     # seconds to wait for server health
+CHARS_PER_TOKEN = 4.0            # empirical: ctx_032768.json = 131072 chars / 32768 tokens
+
+CTXSWEEP_DIR = os.path.dirname(os.path.abspath(__file__))
</file context>


| Bench | Blog Target | This Run | Status |
|-----------------------------|-------------|------------------|--------------------|
| Binary md5 | — | e9cb2790bb8ede64 | — |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md, line 131:

<comment>Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</comment>

<file context>
@@ -0,0 +1,143 @@
+
+| Bench                       | Blog Target  | This Run         | Status             |
+|-----------------------------|-------------|------------------|--------------------|
+| Binary md5                  | —           | e9cb2790bb8ede64 | —                  |
+| HumanEval mean tok/s        | 129.52      | **110.21**       | FAIL -19.3 tok/s   |
+| HumanEval mean AL           | 8.31        | **11.04**        | PASS +2.73         |
</file context>

- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
- B — build flag: DONE (server/build GRAPHS=ON).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses GRAPHS=ON but the actual CMake flag and the rest of the plan use GGML_CUDA_GRAPHS=ON. This could cause implementers to invoke the wrong build toggle.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At thoughts/shared/plans/cuda_graph_replay_team_plan.md, line 20:

<comment>Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</comment>

<file context>
@@ -0,0 +1,32 @@
+- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
+- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
+- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
+- B — build flag: DONE (server/build GRAPHS=ON).
+Total ~970K tokens.
+
</file context>

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

33 issues found across 22 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>

<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">

<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">

<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>

<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
print(f"Server PID: {proc.pid}")

healthy = wait_healthy()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py, line 204:

<comment>Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</comment>

<file context>
@@ -0,0 +1,328 @@
+    proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
+    print(f"Server PID: {proc.pid}")
+
+    healthy = wait_healthy()
+    if not healthy:
+        print("ERROR: Server did not become healthy within timeout")
</file context>


## Verdict

**The 15% gap is PRIMARILY THE MODEL, not the config.**

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md, line 129:

<comment>Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</comment>

<file context>
@@ -0,0 +1,155 @@
+
+## Verdict
+
+**The 15% gap is PRIMARILY THE MODEL, not the config.**
+
+Evidence:
</file context>

return cmd


def launch_server(log_path):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: --run-server path omits the documented flock GPU lock because launch logic is duplicated and inconsistent between launch_server_cmd() and launch_server(). This can cause GPU contention and corrupt benchmark validity.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py, line 159:

<comment>`--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</comment>

<file context>
@@ -0,0 +1,586 @@
+    return cmd
+
+
+def launch_server(log_path):
+    """Spawn the server in a child process. Returns (proc, log_fh)."""
+    env = os.environ.copy()
</file context>


for line in lines:
line = line.strip()
if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: CUDA error detection is broken due to a case mismatch: line.lower() is checked against the mixed-case literal "CUDA error", so that branch can never match and CUDA errors may be missed.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 190:

<comment>CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</comment>

<file context>
@@ -0,0 +1,408 @@
+
+    for line in lines:
+        line = line.strip()
+        if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():
+            result["oom"] = True
+        if "[spec-decode]" in line and "tokens=" in line and "accepted=" in line:
</file context>

deadline = time.time() + timeout
while time.time() < deadline:
try:
result = subprocess.run(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Request failures are silently ignored; send_request does not check result.returncode, and run_cell never validates the response before extracting metrics.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 69:

<comment>Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</comment>

<file context>
@@ -0,0 +1,408 @@
+    deadline = time.time() + timeout
+    while time.time() < deadline:
+        try:
+            result = subprocess.run(
+                ["curl", "-sf", f"http://127.0.0.1:{port}/health"],
+                capture_output=True, text=True, timeout=5
</file context>

wall_s = parse_wall_s(parsed["server_done"])
prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
gate_line = parsed["spec_gate"]
is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: is_ar classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py, line 283:

<comment>`is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</comment>

<file context>
@@ -0,0 +1,437 @@
+    wall_s     = parse_wall_s(parsed["server_done"])
+    prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
+    gate_line  = parsed["spec_gate"]
+    is_ar      = parsed["spec_decode"] is None and parsed["ar_decode"] is None
+
+    gate_floor_reason = "N/A"
</file context>
Suggested change
is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None
is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is not None

action="append",
default=[],
metavar="ARG",
help="Extra arg to pass to BOTH server binaries (repeatable). "

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 358:

<comment>Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</comment>

<file context>
@@ -0,0 +1,452 @@
+        action="append",
+        default=[],
+        metavar="ARG",
+        help="Extra arg to pass to BOTH server binaries (repeatable). "
+             "E.g. --extra-server-arg --cache-type-k --extra-server-arg f16",
+    )
</file context>

SEED = 42
N_GEN = 128 # decode tokens per probe
SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health
CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 61:

<comment>Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</comment>

<file context>
@@ -0,0 +1,452 @@
+SEED            = 42
+N_GEN           = 128            # decode tokens per probe
+SERVER_READY_TIMEOUT_S = 300     # seconds to wait for server health
+CHARS_PER_TOKEN = 4.0            # empirical: ctx_032768.json = 131072 chars / 32768 tokens
+
+CTXSWEEP_DIR = os.path.dirname(os.path.abspath(__file__))
</file context>


| Bench | Blog Target | This Run | Status |
|-----------------------------|-------------|------------------|--------------------|
| Binary md5 | — | e9cb2790bb8ede64 | — |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md, line 131:

<comment>Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</comment>

<file context>
@@ -0,0 +1,143 @@
+
+| Bench                       | Blog Target  | This Run         | Status             |
+|-----------------------------|-------------|------------------|--------------------|
+| Binary md5                  | —           | e9cb2790bb8ede64 | —                  |
+| HumanEval mean tok/s        | 129.52      | **110.21**       | FAIL -19.3 tok/s   |
+| HumanEval mean AL           | 8.31        | **11.04**        | PASS +2.73         |
</file context>

- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
- B — build flag: DONE (server/build GRAPHS=ON).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses GRAPHS=ON but the actual CMake flag and the rest of the plan use GGML_CUDA_GRAPHS=ON. This could cause implementers to invoke the wrong build toggle.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At thoughts/shared/plans/cuda_graph_replay_team_plan.md, line 20:

<comment>Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</comment>

<file context>
@@ -0,0 +1,32 @@
+- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
+- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
+- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
+- B — build flag: DONE (server/build GRAPHS=ON).
+Total ~970K tokens.
+
</file context>

…or QK×PR372 composition

The library foundation of the snapshot×ledger unification (plan in thoughts/),
so the proven QK residency scorer composes with PR#372 across restore at ≥128K.

- Phase 1 (kvflash_pager.h): per-chunk ledger in serialize/deserialize —
  was_resident + qk_score + KV dtype enum; magic bumped KVFLASH1 (old blobs
  cleanly miss); deserialize re-pages only resident chunks; dtype-guard closes
  the latent equal-rowsize swap trap. Unit-tested (ledger round-trip).
- Phase 2 (kvflash_qk.h): rebuild/seed the QK pool from the restored ledger so
  the scorer is warm on turn N+1 instead of scoring every restored chunk as
  missing(-2.0). Unit-tested (8 new checks, restored scores != missing).
- Research/evidence: phase0_bitplane_lsh (the SimHash-on-quant-bits kill-test —
  surprise: MSB ρ=0.871 vs true QK, refutes "≈random", but modest given diffuse
  attention; sign bit carries the ranking); tbq4/tq3 fast-FA prior art.

Phase 3 (consume restored KV instead of re-prefill — VALIDATED: 36.5x warm
prefill, AR greedy bit-identity PASS, binary 0b70418a) is preserved as a patch
(/tmp/b_phase23_plus_blockerA_*.patch); its qwen35_backend.cpp integration is
interleaved with an uncommitted CUDA-graph blocker-A draft and will land after a
clean un-interleave.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 10 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:5">
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md:6">
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</violation>
</file>

<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">

<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</violation>
</file>

<file name="bench/bitplane_lsh_experiment.py">

<violation number="1" location="bench/bitplane_lsh_experiment.py:335">
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:253">
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:273">
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</violation>
</file>

<file name="server/src/common/kvflash_pager.h">

<violation number="1" location="server/src/common/kvflash_pager.h:589">
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

print("Phase 3 KV+SSM seam bug confirmed. Target attention diverges.")
print("The feature mirror is NOT the cause (both arms use AR without draft).")
sys.exit(1)
if c0_self and c0_c1:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/phase3_gate_intraproc.py, line 220:

<comment>Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</comment>

<file context>
@@ -0,0 +1,231 @@
+        print("Phase 3 KV+SSM seam bug confirmed. Target attention diverges.")
+        print("The feature mirror is NOT the cause (both arms use AR without draft).")
+        sys.exit(1)
+    if c0_self and c0_c1:
+        print(f"GATE: PASS (AR mode) — C0 self-consistent AND C1 identical to C0.")
+        print(f"Phase 3 KV+SSM seam is correct. Warm-prefill speedup: {p0:.3f}s -> {p1:.3f}s ({speedup:.1f}x)")
</file context>

@@ -0,0 +1,179 @@
# Fast FlashAttention for very-low-bit (3-bit / ternary) KV cache — prior art

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md, line 5:

<comment>External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</comment>

<file context>
@@ -0,0 +1,179 @@
+
+**Problem:** `tq3_0` KV in llama.cpp/ggml-cuda decodes ~2× slower than `q4_0`/`f16` because there is no fast tensor-core FlashAttention kernel for it. This document surveys how the community (llama.cpp maintainers, research literature, production engines) handles fast attention over sub-4-bit KV.
+
+Research date: 2026-06-22.
+
+---
</file context>

break

# Spearman rank correlation
from scipy.stats import spearmanr

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 335:

<comment>scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</comment>

<file context>
@@ -0,0 +1,392 @@
+            break
+
+    # Spearman rank correlation
+    from scipy.stats import spearmanr
+    rho_1bit, _ = spearmanr(s_true, s_1bit)
+    rho_2bit, _ = spearmanr(s_true, s_2bit)
</file context>

@@ -0,0 +1,277 @@
# TBQ4 fused-dequant FlashAttention — extracted technique (Indras-Mirror/llama.cpp-turboq-mtp)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md, line 273:

<comment>External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</comment>

<file context>
@@ -0,0 +1,277 @@
+## Source URLs (all fetched 2026-06-22)
+
+- Repo: https://github.com/Indras-Mirror/llama.cpp-turboq-mtp
+- Kernel: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/fattn-mma-tbq4.cuh
+- Launcher: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/fattn-mma-tbq4-launch.cuh
+- Centroids/WHT: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/tbq4-cuda.cuh
</file context>

@@ -0,0 +1,277 @@
# TBQ4 fused-dequant FlashAttention — extracted technique (Indras-Mirror/llama.cpp-turboq-mtp)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md, line 253:

<comment>MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</comment>

<file context>
@@ -0,0 +1,277 @@
+  or the visible commit list (see Caveats). The mechanism that *produces* that result — fused
+  dequant, no HBM FP16 KV — is confirmed in code.
+
+## License / attribution
+
+- **MIT** (llama.cpp upstream license; fork shows an MIT badge). Reusing the kernel is permitted
</file context>

if (n < expected) return false;

// Read ledger into a temp buffer before reset() clears state.
std::vector<uint8_t> ledger_was_res(nc, 1u); // default: treat as resident

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided nc before using it to allocate ledger/host buffers and resize chunks_. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/kvflash_pager.h, line 589:

<comment>deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</comment>

<file context>
@@ -515,42 +530,79 @@ class KvFlashPager {
+        if (n < expected) return false;
+
+        // Read ledger into a temp buffer before reset() clears state.
+        std::vector<uint8_t>  ledger_was_res(nc, 1u); // default: treat as resident
+        std::vector<float>    ledger_scores(nc, -std::numeric_limits<float>::infinity());
+        if (has_led) {
</file context>

**Verdict: PARTIAL-REFUTES Momus.**

1-bit MSB is NOT random — Spearman ρ=0.87 vs FULL-QK. It strongly ranks keys.
But 1-bit mass-recall@10% = 0.80 (vs full 0.86), and reaches 0.9 only at k=30%.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md, line 6:

<comment>Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</comment>

<file context>
@@ -0,0 +1,100 @@
+**Verdict: PARTIAL-REFUTES Momus.**
+
+1-bit MSB is NOT random — Spearman ρ=0.87 vs FULL-QK. It strongly ranks keys.
+But 1-bit mass-recall@10% = 0.80 (vs full 0.86), and reaches 0.9 only at k=30%.
+2-bit (magnitude only, no sign) = worse than random at count-recall. 3-bit ≈ full (ρ=0.97).
+
</file context>

…efill (opt-in)

Pooled restore consumes the deserialized KV for the chunk-aligned prefix
[0, snap_pos) and prefills only the suffix [snap_pos, prompt_len), behind
KVFLASH_RESTORE_CONSUME (default 0 = legacy re-prefill).

Validated: turn-2 prefill 36.9x faster (36.9s->1.0s) at ~35K tokens, with greedy
AR output token-for-token IDENTICAL to the re-prefill path. Completes the
snapshot x ledger x QK-pool composition (Phases 1-3).

KNOWN CEILING: above ~35K tokens the AR output DIVERGES from full re-prefill
(reused pooled KV differs from recompute once the 8192-pool evicts at scale).
Do NOT enable default-on for deep context until that divergence is root-caused
(acceptable KV-reuse near-tie flip vs real seam bug). Default-0 keeps production
on the safe re-prefill path.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 1 file (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="server/src/common/moe_hybrid_ffn_eval.cpp">

<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>

<file name="server/test/test_kvflash_placement.cpp">

<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>

<file name="server/src/qwen35moe/qwen35moe_backend.h">

<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>

<file name="server/src/qwen35/qwen35_backend.cpp">

<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>

<violation number="2" location="server/src/qwen35/qwen35_backend.cpp:1198">
P2: Restore-consume misalignment path logs 'falling back to re-prefill' but actually hard-fails the request by returning -1.</violation>
</file>

<file name="server/test/test_kvflash_moe_paged.sh">

<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>

<file name="bench/abc_cache_harness/replay_harness.py">

<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>

<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>

<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>

<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>

<file name="server/src/qwen35/gguf_target_loader.cpp">

<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>

<file name="server/src/draft/draft_gguf_loader.cpp">

<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>

<file name="harness/clients/session_inject_proxy.py">

<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.

(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>

<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>

<file name="harness/clients/run_claude_code.sh">

<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>

<file name="bench/qwen35moe_dflash/RECIPE.md">

<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json:89">
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>

<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">

<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">

<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>

<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>

<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:253">
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</violation>

<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:273">
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</violation>
</file>

<file name="server/src/common/kvflash_pager.h">

<violation number="1" location="server/src/common/kvflash_pager.h:589">
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:5">
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</violation>
</file>

<file name="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md">

<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md:6">
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</violation>
</file>

<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">

<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</violation>
</file>

<file name="bench/bitplane_lsh_experiment.py">

<violation number="1" location="bench/bitplane_lsh_experiment.py:335">
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</violation>
</file>

Tip: Review your code locally with the cubic CLI to iterate faster.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_backend.cpp Outdated
"[kvflash] restore-consume: kv_offset=%d not chunk-aligned "
"(chunk_tokens=%d) — falling back to re-prefill\n",
kv_offset, prefill_ubatch);
set_last_error("kvflash: restore-consume misaligned offset");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Restore-consume misalignment path logs 'falling back to re-prefill' but actually hard-fails the request by returning -1.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 1198:

<comment>Restore-consume misalignment path logs 'falling back to re-prefill' but actually hard-fails the request by returning -1.</comment>

<file context>
@@ -1141,20 +1174,35 @@ int Qwen35Backend::do_prefill(const std::vector<int32_t> & tokens,
+                "[kvflash] restore-consume: kv_offset=%d not chunk-aligned "
+                "(chunk_tokens=%d) — falling back to re-prefill\n",
+                kv_offset, prefill_ubatch);
+            set_last_error("kvflash: restore-consume misaligned offset");
+            return -1;
+        }
</file context>

…; enable consume default-on

The consume-restored-KV path zero-padded kvflash_history_ for the restored prefix,
poisoning the drafter residency scorer under DFLASH_KVFLASH+draft+qk-policy.
Reconstruct it from the Phase-1 ledger scores so the drafter sees correct residency.
Validated under the production spec-decode path: needle retrieved + drafter accept
healthy at 64K/114K under consume. KVFLASH_RESTORE_CONSUME now defaults on
(env=0 force-disables).

Validation (35B-A3B-Q3_K_XL + dflash drafter + kvflash-policy=qk + q4_0 KV):
  ctx  | C0 needle | C1 needle | C0 accept | C1 accept | C0 t3_s | C1 t3_s | speedup
  64K  | RETRIEVED | RETRIEVED | 10.9%     | 10.9%     | 132.7   | 0.2     | 663x
  114K | RETRIEVED | RETRIEVED | 10.9%     | 10.9%     | 165.7   | 1.7     | 97x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant