feat(qwen35moe): pooled chunked prefill + snapshot/restore over KVFlash#430
feat(qwen35moe): pooled chunked prefill + snapshot/restore over KVFlash#430dusterbloom wants to merge 17 commits into
Conversation
…max_ctx The MoE expert placement reserved KV for max_ctx (10 GiB @131072) even with --kvflash, forcing experts cold -> the pool was pure overhead. Reserve for the resident pool instead when the full reservation would force experts cold, so experts stay hot at high max_ctx (decouples max_ctx from the expert-placement cliff). A post-init gate disables KVFlash when it is redundant (full KV already fits all experts hot), keyed on all-hot-with-full-KV so it never disables a pool that is itself keeping experts hot. The rule is a shared pure helper (common/kvflash_placement.h) so future MoE backends inherit it. Unit test (5 cases, no GPU) + hardware-gated integration test (RTX 3090: 2203 cold -> 0 cold @max_ctx 131072, decode 43->66 tok/s).
Add serialize()/deserialize() to KvFlashPager (snapshot the full resident+paged KV in logical chunk order; header-validated against layout) and a factored for_each_segment() helper. serde uses synchronous get/set and adapts to the pinned void* host_data of the async-DMA path (Luce-Org#408). Add critical-chunk pinning (pin_range/is_pinned/unpin_all + a best-effort deadlock floor) OR-ed into the ensure_free_block + reselect protections; empty by default (byte-identical non-pin path). CPU unit test (no GPU) covers serde round-trip, header-guard reject, pinning, deadlock guard, reset.
…r KVFlash Drive the MoE cold-expert hybrid path through KVFlash's resident pool: prompts larger than the pool prefill via a chunk loop over hybrid_forward_batch (eviction automatic in alloc_span); the restore residual delta routes through the same chunked path. Pooled snapshot save/restore serializes the pager into the prefix snapshot (PrefixSnapshot += is_pooled + blob; snapshot_target_cache/restore gain skip_kv; the blob rides the disk prefix-cache via a named tensor so cross-turn 128K restore composes). Drafter-scorer residency + DFLASH_KVFLASH_PIN_SPANS critical-chunk pinning wired in. Composes with the landed KVFlash (Luce-Org#373/Luce-Org#408/Luce-Org#385) and MoE restore (Luce-Org#362); serde adapts to the async pinned host_data. GPU gate (RTX 3090): pooled prefill preserves sink context + stable across pool sizes; cross-turn disk restore round-trips losslessly.
…gment Three complexity cuts, no behavior change (GPU sink-recall gate + serde/ placement unit tests green): - merge restore residual's identical snap_pooled/else chunk loops into one (the else ct ternary already subsumes the pooled case) - extract chunked_prefill() shared by generate_impl kvf_paged + restore residual - inline single-caller for_each_segment template into serialize net -25 lines (54 ins / 79 del).
There was a problem hiding this comment.
6 issues found across 16 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_target_graph.cpp">
<violation number="1" location="server/src/qwen35/qwen35_target_graph.cpp:1572">
P2: Blob refresh on reuse can silently drop KVFlash data when blob presence changes, because no blob tensor is created outside the alloc path.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| // Restore snapshot (skip KV copy when pooled; pager handles KV separately). | ||
| const PrefixSnapshot & snap_ref = prefix_snapshots_[slot]; | ||
| const bool snap_pooled = snap_ref.is_pooled; | ||
| restore_target_cache(snap_ref, cache_, snap_pooled); |
There was a problem hiding this comment.
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 899:
<comment>restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</comment>
<file context>
@@ -851,16 +893,29 @@ GenerateResult Qwen35Backend::restore_and_generate_impl(int slot,
+ // Restore snapshot (skip KV copy when pooled; pager handles KV separately).
+ const PrefixSnapshot & snap_ref = prefix_snapshots_[slot];
+ const bool snap_pooled = snap_ref.is_pooled;
+ restore_target_cache(snap_ref, cache_, snap_pooled);
+
+ // Pooled restore: rebuild pager from blob so KV rows are accessible.
</file context>
| restore_target_cache(snap_ref, cache_, snap_pooled); | |
| if (!restore_target_cache(snap_ref, cache_, snap_pooled)) { | |
| result.error = "restore"; | |
| out_io.emit(-1); | |
| return result; | |
| } |
| int32_t next = 0; | ||
| for (int s = 0; s < n_used; ++s) { | ||
| if (hot_wts[base + s] > 0.0f) continue; | ||
| while ([&]{ for (int k=0; k<n_used; ++k) if (k!=s && hot_sel[base+k]==next) return true; return false; }()) |
There was a problem hiding this comment.
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/moe_hybrid_ffn_eval.cpp, line 1076:
<comment>This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</comment>
<file context>
@@ -1066,6 +1066,19 @@ static bool eval_moe_hybrid_ffn_batched_core(
+ int32_t next = 0;
+ for (int s = 0; s < n_used; ++s) {
+ if (hot_wts[base + s] > 0.0f) continue;
+ while ([&]{ for (int k=0; k<n_used; ++k) if (k!=s && hot_sel[base+k]==next) return true; return false; }())
+ if (++next >= n_hot_init) next = 0;
+ hot_sel[base + s] = next++;
</file context>
| kill -0 "$pid" 2>/dev/null || break | ||
| sleep 2 | ||
| done | ||
| curl -fsS "http://$HOST:$PORT/v1/chat/completions" -H 'Content-Type: application/json' \ |
There was a problem hiding this comment.
P2: Don't use || true to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_kvflash_moe_paged.sh, line 61:
<comment>Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</comment>
<file context>
@@ -0,0 +1,83 @@
+ kill -0 "$pid" 2>/dev/null || break
+ sleep 2
+ done
+ curl -fsS "http://$HOST:$PORT/v1/chat/completions" -H 'Content-Type: application/json' \
+ --data @"$REQ" 2>/dev/null \
+ | python3 -c 'import sys,json; print(json.load(sys.stdin)["choices"][0]["message"]["content"])' \
</file context>
| // qwen3.6-35B-A3B-like budget on a 24 GiB card: | ||
| // ~80 KiB/token KV (5 GiB @ 65536, 10 GiB @ 131072) | ||
| // experts ~13.19 GiB, core ~3.12 GiB, draft ~1.2 GiB present. | ||
| const uint64_t MiB = 1024ull * 1024; |
There was a problem hiding this comment.
P3: Missing #include <cstdint> for uint64_t. Test file relies on transitive include from the header kvflash_placement.h, which makes it fragile against future header cleanup.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/test/test_kvflash_placement.cpp, line 26:
<comment>Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</comment>
<file context>
@@ -0,0 +1,85 @@
+ // qwen3.6-35B-A3B-like budget on a 24 GiB card:
+ // ~80 KiB/token KV (5 GiB @ 65536, 10 GiB @ 131072)
+ // experts ~13.19 GiB, core ~3.12 GiB, draft ~1.2 GiB present.
+ const uint64_t MiB = 1024ull * 1024;
+ const uint64_t GiB = 1024ull * MiB;
+ const uint64_t kv_per_tok = 80 * 1024; // bytes/token
</file context>
|
|
||
| // Persistent pipelined state (initialized once, reused across requests) | ||
| std::unique_ptr<struct PipelinedDecodeState> pipe_state_; | ||
| std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_; |
There was a problem hiding this comment.
P3: New private members are unused dead code (hybrid_spec_graph_cache_, spec_microbench_done_). Drop them until the cache/microbench path is actually implemented.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35moe/qwen35moe_backend.h, line 111:
<comment>New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</comment>
<file context>
@@ -83,13 +96,20 @@ class Qwen35MoeBackend : public Qwen35Backend {
// Persistent pipelined state (initialized once, reused across requests)
std::unique_ptr<struct PipelinedDecodeState> pipe_state_;
+ std::unique_ptr<HybridSpecGraphCache> hybrid_spec_graph_cache_;
+ bool spec_microbench_done_ = false;
bool ensure_pipe_state(int kv_start);
</file context>
…ectness fixes DRAFTER CONVERTER (config-driven): - convert_dflash_to_gguf.py reads all architecture params from config.json (hidden_size, n_layer, mask_token_id, target_layer_ids, layer_types for SWA, sliding_window). No hardcoded constants. - quantize_draft_q8.py shares load_arch with the converter. - GGUF metadata: dflash.mask_token_id, dflash.target_layer_ids[], dflash.block_size, attention.sliding_window + pattern. - draft_gguf_loader.cpp: read_draft_capture_config(), mask from GGUF metadata, block_size override, SWA pattern from metadata. - draft_safetensors_loader.cpp: dynamic layer count, SWA+mask from config.json. - gguf_target_loader.cpp: respect drafter-specified capture layers instead of overwriting with evenly-spaced heuristic. - qwen35_backend.cpp: early-read capture sync + mask token propagation. - internal.h: capture_layer_ids[16], DFLASH_MAX_CAPTURE_LAYERS=16. - dflash27b.h: DFLASH_MAX_CAPTURE_LAYERS=16. SPEC-DECODE PERFORMANCE: - graph_builders.cpp: build_lm_head_projection_step skips rebuild when ctx alive + n_tokens matches (centralized guard; was per-call-site). - qwen35_backend.cpp: do_spec_decode uses member draft_sg_ (not local) for graph persistence; kFastRollbackThreshold env-tunable (DFLASH_FAST_ROLLBACK_MIN, default 5). - dflash_draft_graph.cpp: exact-ctx_len non-view reuse guard (DFLASH_DRAFT_GRAPH_REUSE, default ON). 4MB ctx alloc (was 256MB). - graph_builders.cpp: 4MB ctx alloc (was 64MB). - step_graph.h: graph_ctx_len + graph_used_view tracking fields. SPEC-DECODE CORRECTNESS: - qwen35_target_graph.cpp: DFLASH_FEAT_RING_CAP env overrides the hardcoded 4096 feature ring cap. Default 4096 causes acceptance collapse from 85% to 7.7% EXACTLY at 4096 prompt tokens (ring wrap corrupts features). - qwen35_backend.cpp: mirror init honors DFLASH_FEAT_RING_CAP. - qwen35_dflash_target.cpp: guard against invalid token IDs from GPU argmax at long context (NaN/Inf → clamp to 0, verify rejects gracefully). MOE EXPERIMENTAL (behind flags): - qwen35moe_backend.cpp: DFLASH_MOE_ALLHOT_HYBRID=1 builds moe_hybrid storage even with 0 cold experts to enable pipelined spec-decode verify. - Persistent moe_hybrid_logits_sg_ graph (was 64MB per-token alloc in hybrid_forward_one_token). GPU argmax (4 bytes vs 1MB vocab readback). - Batched verify/replay via hybrid_forward_batch (was 8 sequential forwards). VALIDATED: - 27B dense + reconverted drafter: 57% accept on code gen, 85% on short prompts. block=16 gives 252 tok/s (2.2x AR) on code generation. - 35B-A3B MoE + reconverted new drafter: 86% accept, 245 tok/s (2.1x AR). - Feature ring cap=16384: 85% holds to 5K tokens, 58% to 10K. - Full pFlash + dFlash stack: goldgate agentic trace passes (100% tool calls valid), pFlash cuts 34K prefill from 475s to 208s (2.3x). - repo_inspection prompt: correct answers, spec at 33.8% accept, 34 tok/s.
…ash env vars - DD path: dflash-draft-3.6-bf16-reconverted.gguf (old GGUF had garbage metadata) - DFLASH_DRAFT_BLOCK_SIZE=16 (model card sweet spot) - DFLASH_FEAT_RING_CAP=16384 (default 4096 collapses acceptance at the ring boundary)
… full ctx - drafter GGUF baked rope.freq_base=1M but trains/serves at the target's 10M (converter bug); the unpark guard only corrected the 8-layer drafter, so the 6-layer drafter ran at 1M vs target 10M. Align dw_.rope_theta to the target at both load sites (initial + unpark). - DFLASH_FEAT_RING_CAP default 4096 wrapped the target-feature ring above 4K ctx, feeding the drafter stale features and collapsing accept to 0.1% at 27K. Default to max_ctx so the ring covers the full reserved context; env lowers it for VRAM. - both restore dFlash spec-decode acceptance on long-context MoE (0.1% -> ~16% on 27K agentic; content-dependent ceiling otherwise). - harness: repo_inspection path dflash/->server/ (repo renamed in 39fe251); run_claude_code flags fixed to --allowedTools/--dangerously-skip-permissions (the old --tools/--permission-mode dontAsk are invalid on claude-code 2.x); session_inject_proxy gains --force-temperature, thinking injection and body dump for bench control; add qwen35moe dflash gate harness.
dFlash spec-decode is content-dependent: it wins big on verbatim/copyable
output (drafter accept ~80%, ~235 tok/s) but is 2-4x SLOWER than plain AR on
novel/high-entropy output (accept ~6-16%) — and on this MoE the rejected tokens
still pay full expert-routing verify cost. Gate it on target entropy so the
decoder automatically picks the faster path, transparently, no knobs.
- per decision point compute target top-1 prob p1 (cheap entropy proxy = expected
acceptance) from the logits we already have.
- keep spec at the trained full block (16) when confidence is high; floor the
remainder of the turn to the efficient do_ar_decode (real AR ~100+ tok/s) when
the drafter is losing.
- hysteresis: 1-step probe + sustained-low streak (DFLASH_ENTROPY_SUSTAIN, def 2)
holds full blocks through transient dips ("big blocks on uncertain transitions");
near-tie immediate floor (DFLASH_ENTROPY_TIE_P1, def 0.45) turns verify off when
the argmax is ambiguous.
- threshold DFLASH_ENTROPY_AR_P1 (def 0.90) swept for the Pareto point; gate
default-on, DFLASH_ENTROPY_GATE=0 disables, DFLASH_ENTROPY_DEBUG traces p1.
- measured: verbatim 236 / code-gen->AR 117 / novel->AR 83 tok/s, always >= AR.
- temp 0: semantically equivalent to AR (spec verifies vs target argmax; both take
the argmax). Not bit-identical — near-tie argmax flips via verify-batch FP
reduction order, the established spec-decode bar.
…after cliff Two changes that make dFlash spec-decode safe and useful across content and context length without per-model tuning. 1. Long-context drafter cliff fix. The block-diffusion drafter's prediction collapses when it self-attends more than ~2048 tokens (measured: 93% accept at draft_ctx<=2048 vs 6% at 4096, independent of total prompt context). The old default ran it at max(2048, draft_ctx_max=4096)=4096 — past the drafter's effective limit — so spec-decode died above ~2K context. Cap the drafter's self-attention at 2048 by default; spec now holds 77-93% accept / 110-200 tok/s out to 35K context for recent-derived output. DFLASH_DRAFT_CTX_MAX overrides for drafters with a larger usable window. 2. Self-calibrating commit-EMA gate (replaces the p1-entropy gate). dFlash wins only when its realized throughput beats AR; that break-even is model- and context-dependent (a fixed entropy threshold over-floored dense, under-floored MoE). Measure t_ar once per process (cached on the backend, no per-turn warmup tax), then floor the remainder of a turn to the efficient AR path when the EMA of commit_n*t_ar/step_wall stays below 1.0 (spec slower than AR) for a few steps. Knob-free, never slower than AR; floors novel/high-entropy turns, keeps spec on code/structured. Env: DFLASH_SPEC_GATE(=1), _MARGIN, _SUSTAIN, _WARMUP, _DEBUG. Applies to both base (do_spec_decode) and MoE hybrid (do_hybrid_spec_decode) paths. Temp 0: semantically equivalent to AR.
…code; bound MoE prefill sync - root cause: the long-context accept cliff was the target-feature ring cap (FEAT_RING_CAP), NOT a drafter 2048 self-attention limit. When prompt_tokens > ring_cap the ring wraps, the drafter cross-attends stale features, and the commit-EMA gate floors to AR. Verified by a fully-crossed draft_ctx x ring_cap 2x2 (momus-reviewed). - ring_cap must be >= max_ctx (mandatory for correctness); the shipped RECIPE's FEAT_RING_CAP=4096 reintroduced the cliff. - DRAFT_CTX_MAX=2048 was an amputation, not a fix: needle-in-middle control shows it craters distant recall ~46pp (marker 2.6K from end: 76.8% -> 30.9%). draft_ctx=8192 is the VRAM-max uncrippled window (~4x recall reach), with DFLASH_FEATURE_DTYPE=f16 freeing the mirror headroom. - KV quant does not move the draft_ctx ceiling (q4/q8/tq3 all cap at 8192) -- the limit is the draft compute graph's decode scratch, not KV reservation; draft_ctx>8192 is self-defeating (gate floors to AR even when f16 makes it fit). - bound the MoE prefill feature-sync at qwen35moe_backend.cpp:1205 to min(committed, cap) (was raw committed, silently no-ops when committed>cap -> stale features), matching the restore-path and base-path patterns. - add ctxsweep harness + prompt fixtures + analysis docs documenting the 2x2, KV grid, and needle controls.
There was a problem hiding this comment.
12 issues found across 53 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_target_graph.cpp">
<violation number="1" location="server/src/qwen35/qwen35_target_graph.cpp:1572">
P2: Blob refresh on reuse can silently drop KVFlash data when blob presence changes, because no blob tensor is created outside the alloc path.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="bench/abc_cache_harness/replay_harness.py">
<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>
<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>
<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>
<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>
<file name="harness/clients/session_inject_proxy.py">
<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>
<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>
<file name="harness/clients/run_claude_code.sh">
<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>
<file name="bench/qwen35moe_dflash/RECIPE.md">
<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
| // Read target_layer_ids array (exact capture positions from training). | ||
| std::snprintf(key, sizeof(key), "%s.%s", A.c_str(), "dflash.target_layer_ids"); | ||
| int64_t tli_id = gguf_find_key(gctx, key); | ||
| if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY) { |
There was a problem hiding this comment.
P1: target_layer_ids element type is not validated before casting to int32_t*. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/draft/draft_gguf_loader.cpp, line 158:
<comment>`target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</comment>
<file context>
@@ -117,6 +117,65 @@ int count_swa_layers(const DraftWeights & w) {
+ // Read target_layer_ids array (exact capture positions from training).
+ std::snprintf(key, sizeof(key), "%s.%s", A.c_str(), "dflash.target_layer_ids");
+ int64_t tli_id = gguf_find_key(gctx, key);
+ if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY) {
+ const size_t n = std::min((size_t)gguf_get_arr_n(gctx, tli_id),
+ (size_t)max_ids);
</file context>
| if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY) { | |
| if (tli_id >= 0 && gguf_get_kv_type(gctx, tli_id) == GGUF_TYPE_ARRAY && | |
| gguf_get_arr_type(gctx, tli_id) == GGUF_TYPE_INT32) { |
| while not log_path.exists() and time.time() < deadline: | ||
| time.sleep(1) | ||
|
|
||
| cache_off = done_off = spec_off = ar_off = pflash_off = survival_off = 0 |
There was a problem hiding this comment.
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/replay_harness.py, line 723:
<comment>Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</comment>
<file context>
@@ -0,0 +1,1361 @@
+ while not log_path.exists() and time.time() < deadline:
+ time.sleep(1)
+
+ cache_off = done_off = spec_off = ar_off = pflash_off = survival_off = 0
+
+ results = []
</file context>
| // If N changed from default 5, the IDs were definitely set by | ||
| // early-read and should be respected. | ||
| const bool was_early_read = (N != DFLASH27B_DRAFT_N_TARGET_LAYERS); | ||
| if (was_early_read) { |
There was a problem hiding this comment.
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/gguf_target_loader.cpp, line 480:
<comment>Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</comment>
<file context>
@@ -463,12 +463,41 @@ bool load_target_gguf_partial(const std::string & path,
+ // If N changed from default 5, the IDs were definitely set by
+ // early-read and should be respected.
+ const bool was_early_read = (N != DFLASH27B_DRAFT_N_TARGET_LAYERS);
+ if (was_early_read) {
+ std::printf("[loader] using drafter-specified capture layers (%d)\n", N);
+ } else {
</file context>
| obj["extra_body"]["session_id"] = self.session_id | ||
| if self.force_temperature is not None: | ||
| obj["temperature"] = self.force_temperature | ||
| if self.think_budget and path.startswith("/v1/messages"): |
There was a problem hiding this comment.
P2: think_budget uses truthiness, so 0 is treated as "unset" and skips thinking injection for /v1/messages.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.)
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/session_inject_proxy.py, line 125:
<comment>`think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) </comment>
<file context>
@@ -99,14 +102,28 @@ def do_POST(self):
+ obj["extra_body"]["session_id"] = self.session_id
+ if self.force_temperature is not None:
+ obj["temperature"] = self.force_temperature
+ if self.think_budget and path.startswith("/v1/messages"):
+ obj["thinking"] = {"type": "enabled", "budget_tokens": self.think_budget}
body = json.dumps(obj).encode("utf-8")
</file context>
| --model "$MODEL_ID" \ | ||
| --tools "$CLAUDE_TOOLS" \ | ||
| --permission-mode dontAsk \ | ||
| --dangerously-skip-permissions \ |
There was a problem hiding this comment.
P2: CLAUDE_TOOLS config is now ignored because --tools was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/run_claude_code.sh, line 79:
<comment>`CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</comment>
<file context>
@@ -69,9 +76,9 @@ timeout "${CLAUDE_TIMEOUT}s" "$CLAUDE_BIN" \
--model "$MODEL_ID" \
- --tools "$CLAUDE_TOOLS" \
- --permission-mode dontAsk \
+ --dangerously-skip-permissions \
--no-session-persistence \
+ "${CLAUDE_EXTRA[@]}" \
</file context>
| --dangerously-skip-permissions \ | |
| --tools "$CLAUDE_TOOLS" \ | |
| --dangerously-skip-permissions \ |
| str(SERVER_BIN), | ||
| str(TGT), | ||
| "--host", HOST, | ||
| "--port", str(PORT), |
There was a problem hiding this comment.
P2: Configured --port is ignored when launching the server; server and client can target different ports.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/replay_harness.py, line 514:
<comment>Configured `--port` is ignored when launching the server; server and client can target different ports.</comment>
<file context>
@@ -0,0 +1,1361 @@
+ str(SERVER_BIN),
+ str(TGT),
+ "--host", HOST,
+ "--port", str(PORT),
+ "--max-ctx", str(MAX_CTX),
+ "--cache-type-k", ctk,
</file context>
| | f16 | 18.0s | 174 | 76.8 | 12.86 | | ||
| | q4_0 | 18.0s | 167 | 76.8 | 12.86 | | ||
| | q8_0 | 18.1s | 143 | 66.4 | 11.25 | | ||
| | tq3_0 | 23.6s | 109 | 76.8 | 12.86 | |
There was a problem hiding this comment.
P3: Truncated sentence in KV precision sweep analysis — f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4 cuts off mid-thought with no closing paren or wrap-up for the section.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/NOTES.md, line 51:
<comment>Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</comment>
<file context>
@@ -0,0 +1,56 @@
+| f16 | 18.0s | 174 | 76.8 | 12.86 |
+| q4_0 | 18.0s | 167 | 76.8 | 12.86 |
+| q8_0 | 18.1s | 143 | 66.4 | 11.25 |
+| tq3_0 | 23.6s | 109 | 76.8 | 12.86 |
+f16 best; q4_0 EQUAL (free VRAM saver, no accept/AL cost); q8_0 ANOMALOUS (lower accept 66.4
+## KVFlash added to the 35B agentic config?
</file context>
|
|
||
| if not args.session_id: | ||
| print("[session-proxy] WARNING: no session_id set; proxy is pass-through only", flush=True) | ||
| if not args.session_id and args.force_temperature is None: |
There was a problem hiding this comment.
P3: Startup warning is inaccurate when only THINK_BUDGET is configured. It can mislead debugging because proxy is not pass-through in that mode.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At harness/clients/session_inject_proxy.py, line 143:
<comment>Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</comment>
<file context>
@@ -120,19 +137,23 @@ def main():
- if not args.session_id:
- print("[session-proxy] WARNING: no session_id set; proxy is pass-through only", flush=True)
+ if not args.session_id and args.force_temperature is None:
+ print("[session-proxy] WARNING: no session_id or force_temperature set; proxy is pass-through only", flush=True)
</file context>
| - ❌ `DFLASH_DRAFT_CTX_MAX` < 8192 — amputates distant recall (see recall-horizon table). | ||
| - ❌ a different `draft_ctx`/ring/rope without re-checking accept — these are the documented footguns (see GOTCHAS.md). | ||
|
|
||
| See `GOTCHAS.md` (same dir) for the full footgun list, `charbench/NOTES.md` and |
There was a problem hiding this comment.
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/RECIPE.md, line 123:
<comment>Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</comment>
<file context>
@@ -0,0 +1,124 @@
+- ❌ `DFLASH_DRAFT_CTX_MAX` < 8192 — amputates distant recall (see recall-horizon table).
+- ❌ a different `draft_ctx`/ring/rope without re-checking accept — these are the documented footguns (see GOTCHAS.md).
+
+See `GOTCHAS.md` (same dir) for the full footgun list, `charbench/NOTES.md` and
+`ctxsweep/NOTES.md` for the supporting measurements.
</file context>
…ignored draft_ctx knob Two methodology flaws in the prior sweeps invalidated part of the recipe; a clean cold re-baseline (one prompt per fresh server, temp 0) establishes the truth. - DFLASH_DRAFT_CTX_MAX is IGNORED on the MoE backend: qwen35moe_backend.cpp:2267 caps draft_ctx at max(2048, cfg_.draft_ctx_max=4096); the getenv exists only in the dense qwen35 backend. Every 2048/8192/16384 sweep changed an ignored var — draft_ctx was always 4096. The "draft_ctx=8192 uncripples distant recall" narrative was warm-EMA request-ordering + variance, not draft_ctx. - DFLASH_FEATURE_DTYPE=f16 floors spec-decode to AR on every prompt (quantizes the cross-attended target features). Dropped — f32 mirror required. - Distant recall works at the pinned draft_ctx=4096 via the drafter's cross-attention to the target-feature ring; needle 12K-deep holds 28.7% + reproduces the marker. So FEAT_RING_CAP=max_ctx is the sole real lever (no draft_ctx env port needed). - Corrected recipe: FEAT_RING_CAP=max_ctx + f32 mirror + q4_0 KV, nothing else. - Honest cold numbers (ring=max_ctx, f32, q4_0): recent/copy 76.8% / AL 12.86 / ~172 tok/s through 35K (~2.0x the AR ~86 floor); distant 12K-deep ~29%. The prior "92.7%" was a warm-EMA artifact. - Add clean_rebaseline.md (authoritative) + earlyexit/asym/f16-isolation evidence; banner the superseded draft_ctx-varying docs.
There was a problem hiding this comment.
1 issue found across 13 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="bench/abc_cache_harness/replay_harness.py">
<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>
<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>
<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>
<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>
<file name="harness/clients/session_inject_proxy.py">
<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>
<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>
<file name="harness/clients/run_claude_code.sh">
<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>
<file name="bench/qwen35moe_dflash/RECIPE.md">
<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json:89">
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| "mirror_cap": 40960, | ||
| "prompt": "needle_06k", | ||
| "status": "OK", | ||
| "accept_pct": 92.7, |
There was a problem hiding this comment.
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json, line 89:
<comment>Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</comment>
<file context>
@@ -0,0 +1,130 @@
+ "mirror_cap": 40960,
+ "prompt": "needle_06k",
+ "status": "OK",
+ "accept_pct": 92.7,
+ "avg_commit": 14.83,
+ "decode_tps_spec": 220.57,
</file context>
…restore works
At ≥128K with the KVFlash pool active, turn 1 never saved a prefix snapshot —
the pooled-prefill branch was stubbed to a diagnostic ("boundary snapshot
skipped: pooled prefill relocates chunks") and returned without saving. So turn
2 found nothing to restore (prefix_len=0), fell back to a full cold re-prefill
(0.8s→77.6s), decode regressed 80→20 tok/s, and turn 3 crashed. The all-hot
35B-A3B runs the dense Qwen35Backend path (moe_hybrid==nullptr), so this was the
live bug for the user's deep-context (>128K = 39% of real prompts) workload.
- add KvFlashPager::serialize(max_chunks) to capture only chunks [0, max_chunks)
— the chunk-aligned turn boundary, not the whole prompt.
- add Qwen35Backend::snapshot_save_pooled_at(slot, boundary): floor the requested
snap_pos to a chunk multiple, set cur_pos to that boundary, serialize the
partial pager blob, and save it (the restore/deserialize path already existed
and was correct — only the save was missing).
- replace the pooled-prefill skip stub at the chunk-aligned boundary with the
real save; mirror the same save on the qwen35moe hybrid path.
- unit tests: floor_to_chunk + serialize(max_chunks) partial round-trip
(bit-identical first k chunks).
131K 3-turn smoke: turn-2 restore=true prefix_len=34077 (97.5% hit), turn-3
restore=true, no crash, tool_call_valid=1.0, decode recovered 20→56-59 tok/s.
Known follow-up: warm-prefill at 131K is still ~44s (deserialize re-pages the
whole pool) — correctness/crash/decode are fixed; restoring only resident chunks
is the next optimization.
Placement called kvflash_pool_from_env(max_context) with default args, taking the no-budget fallback (max_ctx/2). Runtime sizes the same pool with the real VRAM budget + scorer policy, getting a speed-capped value (e.g. 16384). On DFLASH_KVFLASH=auto at high max_ctx this over-reserved KV ~4x, under-budgeting experts and reducing hot placement. Extract make_kvflash_budget() + kvflash_scorer_expected() and call them from both sites so reservation and runtime allocation size the pool identically. Add a pure unit test pinning the budgeted vs no-budget divergence. (cherry picked from commit 656accd)
9b501bd to
fc90d1e
Compare
…raph-refactor scaffolding The full evidence trail from the deep-dive on Qwen3.6-27B dFlash vs the published Qwen3.5-27B blog, the user's real session distribution, and the CUDA-graph decode refactor plan. - 27B beat-blog (model_ab_3.6_vs_3.5, beat_blog_results): best Qwen3.6-27B-Q4 = 124.8 tok/s mean (96.4% of the blog's 129.52) at --ddtree-budget 16, AL 11.15 vs blog 8.31 (+34% — our drafter accepts more). Per-step decomposition: verify cost identical (35ms); the 1.39x per-step gap is the 48 GatedDeltaNet SSM layers of the 3.6 hybrid architecture (16 attn + 48 SSM of 64), not config or implementation. q8_0 drafter dead on Ampere (scalar fallback). The gap is the model, not us. - Real session distribution (session_distribution + analyze_sessions): 117 sessions, median prompt 37 tok / max 119k, median CONTEXT 94k, 39% of prompts land >128k — the workload lives in deep context, which is why cache-persistence + long-context decode are the load-bearing levers, not HumanEval-short. - Equity audit + AR-vs-dFlash scaling + dense-vs-MoE best-config: dense was under-benched (q4-only, no ddtree); decode is under-tuned not ceiling; synthetic copy prompts inflate dFlash ~1.6-2x vs real agentic. - Graph-refactor scaffolding: bit_identity_gate.py (4K/32K/71K token-for-token AR gate) + thoughts/shared/plans/cuda_graph_replay_team_plan.md (token-sized A/C/D plan; A draft exists uncommitted, validated behind the gate before merge). - New long-context prompt fixtures (57k/64k/128k) + clean bench drivers.
There was a problem hiding this comment.
33 issues found across 22 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>
<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">
<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">
<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>
<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path) | ||
| print(f"Server PID: {proc.pid}") | ||
|
|
||
| healthy = wait_healthy() |
There was a problem hiding this comment.
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py, line 204:
<comment>Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</comment>
<file context>
@@ -0,0 +1,328 @@
+ proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
+ print(f"Server PID: {proc.pid}")
+
+ healthy = wait_healthy()
+ if not healthy:
+ print("ERROR: Server did not become healthy within timeout")
</file context>
|
|
||
| ## Verdict | ||
|
|
||
| **The 15% gap is PRIMARILY THE MODEL, not the config.** |
There was a problem hiding this comment.
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md, line 129:
<comment>Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</comment>
<file context>
@@ -0,0 +1,155 @@
+
+## Verdict
+
+**The 15% gap is PRIMARILY THE MODEL, not the config.**
+
+Evidence:
</file context>
| return cmd | ||
|
|
||
|
|
||
| def launch_server(log_path): |
There was a problem hiding this comment.
P1: --run-server path omits the documented flock GPU lock because launch logic is duplicated and inconsistent between launch_server_cmd() and launch_server(). This can cause GPU contention and corrupt benchmark validity.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py, line 159:
<comment>`--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</comment>
<file context>
@@ -0,0 +1,586 @@
+ return cmd
+
+
+def launch_server(log_path):
+ """Spawn the server in a child process. Returns (proc, log_fh)."""
+ env = os.environ.copy()
</file context>
|
|
||
| for line in lines: | ||
| line = line.strip() | ||
| if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower(): |
There was a problem hiding this comment.
P1: CUDA error detection is broken due to a case mismatch: line.lower() is checked against the mixed-case literal "CUDA error", so that branch can never match and CUDA errors may be missed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 190:
<comment>CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</comment>
<file context>
@@ -0,0 +1,408 @@
+
+ for line in lines:
+ line = line.strip()
+ if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():
+ result["oom"] = True
+ if "[spec-decode]" in line and "tokens=" in line and "accepted=" in line:
</file context>
| deadline = time.time() + timeout | ||
| while time.time() < deadline: | ||
| try: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
P1: Request failures are silently ignored; send_request does not check result.returncode, and run_cell never validates the response before extracting metrics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 69:
<comment>Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</comment>
<file context>
@@ -0,0 +1,408 @@
+ deadline = time.time() + timeout
+ while time.time() < deadline:
+ try:
+ result = subprocess.run(
+ ["curl", "-sf", f"http://127.0.0.1:{port}/health"],
+ capture_output=True, text=True, timeout=5
</file context>
| wall_s = parse_wall_s(parsed["server_done"]) | ||
| prompt_tok = parse_prompt_tok_from_done(parsed["server_done"]) | ||
| gate_line = parsed["spec_gate"] | ||
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None |
There was a problem hiding this comment.
P2: is_ar classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py, line 283:
<comment>`is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</comment>
<file context>
@@ -0,0 +1,437 @@
+ wall_s = parse_wall_s(parsed["server_done"])
+ prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
+ gate_line = parsed["spec_gate"]
+ is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None
+
+ gate_floor_reason = "N/A"
</file context>
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None | |
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is not None |
| action="append", | ||
| default=[], | ||
| metavar="ARG", | ||
| help="Extra arg to pass to BOTH server binaries (repeatable). " |
There was a problem hiding this comment.
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 358:
<comment>Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</comment>
<file context>
@@ -0,0 +1,452 @@
+ action="append",
+ default=[],
+ metavar="ARG",
+ help="Extra arg to pass to BOTH server binaries (repeatable). "
+ "E.g. --extra-server-arg --cache-type-k --extra-server-arg f16",
+ )
</file context>
| SEED = 42 | ||
| N_GEN = 128 # decode tokens per probe | ||
| SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health | ||
| CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens |
There was a problem hiding this comment.
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 61:
<comment>Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</comment>
<file context>
@@ -0,0 +1,452 @@
+SEED = 42
+N_GEN = 128 # decode tokens per probe
+SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health
+CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens
+
+CTXSWEEP_DIR = os.path.dirname(os.path.abspath(__file__))
</file context>
|
|
||
| | Bench | Blog Target | This Run | Status | | ||
| |-----------------------------|-------------|------------------|--------------------| | ||
| | Binary md5 | — | e9cb2790bb8ede64 | — | |
There was a problem hiding this comment.
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md, line 131:
<comment>Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</comment>
<file context>
@@ -0,0 +1,143 @@
+
+| Bench | Blog Target | This Run | Status |
+|-----------------------------|-------------|------------------|--------------------|
+| Binary md5 | — | e9cb2790bb8ede64 | — |
+| HumanEval mean tok/s | 129.52 | **110.21** | FAIL -19.3 tok/s |
+| HumanEval mean AL | 8.31 | **11.04** | PASS +2.73 |
</file context>
| - D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens. | ||
| - gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens. | ||
| - int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens. | ||
| - B — build flag: DONE (server/build GRAPHS=ON). |
There was a problem hiding this comment.
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses GRAPHS=ON but the actual CMake flag and the rest of the plan use GGML_CUDA_GRAPHS=ON. This could cause implementers to invoke the wrong build toggle.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At thoughts/shared/plans/cuda_graph_replay_team_plan.md, line 20:
<comment>Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</comment>
<file context>
@@ -0,0 +1,32 @@
+- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
+- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
+- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
+- B — build flag: DONE (server/build GRAPHS=ON).
+Total ~970K tokens.
+
</file context>
There was a problem hiding this comment.
33 issues found across 22 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>
<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">
<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">
<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>
<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path) | ||
| print(f"Server PID: {proc.pid}") | ||
|
|
||
| healthy = wait_healthy() |
There was a problem hiding this comment.
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py, line 204:
<comment>Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</comment>
<file context>
@@ -0,0 +1,328 @@
+ proc, log_fd = launch_server(dtype, draft_ctx_max_str, log_path)
+ print(f"Server PID: {proc.pid}")
+
+ healthy = wait_healthy()
+ if not healthy:
+ print("ERROR: Server did not become healthy within timeout")
</file context>
|
|
||
| ## Verdict | ||
|
|
||
| **The 15% gap is PRIMARILY THE MODEL, not the config.** |
There was a problem hiding this comment.
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md, line 129:
<comment>Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</comment>
<file context>
@@ -0,0 +1,155 @@
+
+## Verdict
+
+**The 15% gap is PRIMARILY THE MODEL, not the config.**
+
+Evidence:
</file context>
| return cmd | ||
|
|
||
|
|
||
| def launch_server(log_path): |
There was a problem hiding this comment.
P1: --run-server path omits the documented flock GPU lock because launch logic is duplicated and inconsistent between launch_server_cmd() and launch_server(). This can cause GPU contention and corrupt benchmark validity.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py, line 159:
<comment>`--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</comment>
<file context>
@@ -0,0 +1,586 @@
+ return cmd
+
+
+def launch_server(log_path):
+ """Spawn the server in a child process. Returns (proc, log_fh)."""
+ env = os.environ.copy()
</file context>
|
|
||
| for line in lines: | ||
| line = line.strip() | ||
| if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower(): |
There was a problem hiding this comment.
P1: CUDA error detection is broken due to a case mismatch: line.lower() is checked against the mixed-case literal "CUDA error", so that branch can never match and CUDA errors may be missed.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 190:
<comment>CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</comment>
<file context>
@@ -0,0 +1,408 @@
+
+ for line in lines:
+ line = line.strip()
+ if "out of memory" in line.lower() or "OOM" in line or "CUDA error" in line.lower():
+ result["oom"] = True
+ if "[spec-decode]" in line and "tokens=" in line and "accepted=" in line:
</file context>
| deadline = time.time() + timeout | ||
| while time.time() < deadline: | ||
| try: | ||
| result = subprocess.run( |
There was a problem hiding this comment.
P1: Request failures are silently ignored; send_request does not check result.returncode, and run_cell never validates the response before extracting metrics.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py, line 69:
<comment>Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</comment>
<file context>
@@ -0,0 +1,408 @@
+ deadline = time.time() + timeout
+ while time.time() < deadline:
+ try:
+ result = subprocess.run(
+ ["curl", "-sf", f"http://127.0.0.1:{port}/health"],
+ capture_output=True, text=True, timeout=5
</file context>
| wall_s = parse_wall_s(parsed["server_done"]) | ||
| prompt_tok = parse_prompt_tok_from_done(parsed["server_done"]) | ||
| gate_line = parsed["spec_gate"] | ||
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None |
There was a problem hiding this comment.
P2: is_ar classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py, line 283:
<comment>`is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</comment>
<file context>
@@ -0,0 +1,437 @@
+ wall_s = parse_wall_s(parsed["server_done"])
+ prompt_tok = parse_prompt_tok_from_done(parsed["server_done"])
+ gate_line = parsed["spec_gate"]
+ is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None
+
+ gate_floor_reason = "N/A"
</file context>
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is None | |
| is_ar = parsed["spec_decode"] is None and parsed["ar_decode"] is not None |
| action="append", | ||
| default=[], | ||
| metavar="ARG", | ||
| help="Extra arg to pass to BOTH server binaries (repeatable). " |
There was a problem hiding this comment.
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 358:
<comment>Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</comment>
<file context>
@@ -0,0 +1,452 @@
+ action="append",
+ default=[],
+ metavar="ARG",
+ help="Extra arg to pass to BOTH server binaries (repeatable). "
+ "E.g. --extra-server-arg --cache-type-k --extra-server-arg f16",
+ )
</file context>
| SEED = 42 | ||
| N_GEN = 128 # decode tokens per probe | ||
| SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health | ||
| CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens |
There was a problem hiding this comment.
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py, line 61:
<comment>Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</comment>
<file context>
@@ -0,0 +1,452 @@
+SEED = 42
+N_GEN = 128 # decode tokens per probe
+SERVER_READY_TIMEOUT_S = 300 # seconds to wait for server health
+CHARS_PER_TOKEN = 4.0 # empirical: ctx_032768.json = 131072 chars / 32768 tokens
+
+CTXSWEEP_DIR = os.path.dirname(os.path.abspath(__file__))
</file context>
|
|
||
| | Bench | Blog Target | This Run | Status | | ||
| |-----------------------------|-------------|------------------|--------------------| | ||
| | Binary md5 | — | e9cb2790bb8ede64 | — | |
There was a problem hiding this comment.
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md, line 131:
<comment>Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</comment>
<file context>
@@ -0,0 +1,143 @@
+
+| Bench | Blog Target | This Run | Status |
+|-----------------------------|-------------|------------------|--------------------|
+| Binary md5 | — | e9cb2790bb8ede64 | — |
+| HumanEval mean tok/s | 129.52 | **110.21** | FAIL -19.3 tok/s |
+| HumanEval mean AL | 8.31 | **11.04** | PASS +2.73 |
</file context>
| - D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens. | ||
| - gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens. | ||
| - int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens. | ||
| - B — build flag: DONE (server/build GRAPHS=ON). |
There was a problem hiding this comment.
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses GRAPHS=ON but the actual CMake flag and the rest of the plan use GGML_CUDA_GRAPHS=ON. This could cause implementers to invoke the wrong build toggle.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At thoughts/shared/plans/cuda_graph_replay_team_plan.md, line 20:
<comment>Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</comment>
<file context>
@@ -0,0 +1,32 @@
+- D — bucket FA read-window to a 4096 stride (re-capture once/4096 tok). Owner: GLM5.2. ~120K tokens.
+- gate — bit-identity harness 4K/32K/71K token-for-token temp-0 + nsys. Owner: Claude. ~100K tokens.
+- int — integrate A+C+D, per-stage gate, nsys verify, review. Owner: Claude. ~150K tokens.
+- B — build flag: DONE (server/build GRAPHS=ON).
+Total ~970K tokens.
+
</file context>
…or QK×PR372 composition The library foundation of the snapshot×ledger unification (plan in thoughts/), so the proven QK residency scorer composes with PR#372 across restore at ≥128K. - Phase 1 (kvflash_pager.h): per-chunk ledger in serialize/deserialize — was_resident + qk_score + KV dtype enum; magic bumped KVFLASH1 (old blobs cleanly miss); deserialize re-pages only resident chunks; dtype-guard closes the latent equal-rowsize swap trap. Unit-tested (ledger round-trip). - Phase 2 (kvflash_qk.h): rebuild/seed the QK pool from the restored ledger so the scorer is warm on turn N+1 instead of scoring every restored chunk as missing(-2.0). Unit-tested (8 new checks, restored scores != missing). - Research/evidence: phase0_bitplane_lsh (the SimHash-on-quant-bits kill-test — surprise: MSB ρ=0.871 vs true QK, refutes "≈random", but modest given diffuse attention; sign bit carries the ranking); tbq4/tq3 fast-FA prior art. Phase 3 (consume restored KV instead of re-prefill — VALIDATED: 36.5x warm prefill, AR greedy bit-identity PASS, binary 0b70418a) is preserved as a patch (/tmp/b_phase23_plus_blockerA_*.patch); its qwen35_backend.cpp integration is interleaved with an uncommitted CUDA-graph blocker-A draft and will land after a clean un-interleave.
There was a problem hiding this comment.
7 issues found across 10 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:5">
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md:6">
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</violation>
</file>
<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">
<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</violation>
</file>
<file name="bench/bitplane_lsh_experiment.py">
<violation number="1" location="bench/bitplane_lsh_experiment.py:335">
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:253">
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:273">
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</violation>
</file>
<file name="server/src/common/kvflash_pager.h">
<violation number="1" location="server/src/common/kvflash_pager.h:589">
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| print("Phase 3 KV+SSM seam bug confirmed. Target attention diverges.") | ||
| print("The feature mirror is NOT the cause (both arms use AR without draft).") | ||
| sys.exit(1) | ||
| if c0_self and c0_c1: |
There was a problem hiding this comment.
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/abc_cache_harness/phase3_gate_intraproc.py, line 220:
<comment>Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</comment>
<file context>
@@ -0,0 +1,231 @@
+ print("Phase 3 KV+SSM seam bug confirmed. Target attention diverges.")
+ print("The feature mirror is NOT the cause (both arms use AR without draft).")
+ sys.exit(1)
+ if c0_self and c0_c1:
+ print(f"GATE: PASS (AR mode) — C0 self-consistent AND C1 identical to C0.")
+ print(f"Phase 3 KV+SSM seam is correct. Warm-prefill speedup: {p0:.3f}s -> {p1:.3f}s ({speedup:.1f}x)")
</file context>
| @@ -0,0 +1,179 @@ | |||
| # Fast FlashAttention for very-low-bit (3-bit / ternary) KV cache — prior art | |||
There was a problem hiding this comment.
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md, line 5:
<comment>External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</comment>
<file context>
@@ -0,0 +1,179 @@
+
+**Problem:** `tq3_0` KV in llama.cpp/ggml-cuda decodes ~2× slower than `q4_0`/`f16` because there is no fast tensor-core FlashAttention kernel for it. This document surveys how the community (llama.cpp maintainers, research literature, production engines) handles fast attention over sub-4-bit KV.
+
+Research date: 2026-06-22.
+
+---
</file context>
| break | ||
|
|
||
| # Spearman rank correlation | ||
| from scipy.stats import spearmanr |
There was a problem hiding this comment.
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/bitplane_lsh_experiment.py, line 335:
<comment>scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</comment>
<file context>
@@ -0,0 +1,392 @@
+ break
+
+ # Spearman rank correlation
+ from scipy.stats import spearmanr
+ rho_1bit, _ = spearmanr(s_true, s_1bit)
+ rho_2bit, _ = spearmanr(s_true, s_2bit)
</file context>
| @@ -0,0 +1,277 @@ | |||
| # TBQ4 fused-dequant FlashAttention — extracted technique (Indras-Mirror/llama.cpp-turboq-mtp) | |||
There was a problem hiding this comment.
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md, line 273:
<comment>External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</comment>
<file context>
@@ -0,0 +1,277 @@
+## Source URLs (all fetched 2026-06-22)
+
+- Repo: https://github.com/Indras-Mirror/llama.cpp-turboq-mtp
+- Kernel: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/fattn-mma-tbq4.cuh
+- Launcher: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/fattn-mma-tbq4-launch.cuh
+- Centroids/WHT: https://raw.githubusercontent.com/Indras-Mirror/llama.cpp-turboq-mtp/master/ggml/src/ggml-cuda/tbq4-cuda.cuh
</file context>
| @@ -0,0 +1,277 @@ | |||
| # TBQ4 fused-dequant FlashAttention — extracted technique (Indras-Mirror/llama.cpp-turboq-mtp) | |||
There was a problem hiding this comment.
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md, line 253:
<comment>MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</comment>
<file context>
@@ -0,0 +1,277 @@
+ or the visible commit list (see Caveats). The mechanism that *produces* that result — fused
+ dequant, no HBM FP16 KV — is confirmed in code.
+
+## License / attribution
+
+- **MIT** (llama.cpp upstream license; fork shows an MIT badge). Reusing the kernel is permitted
</file context>
| if (n < expected) return false; | ||
|
|
||
| // Read ledger into a temp buffer before reset() clears state. | ||
| std::vector<uint8_t> ledger_was_res(nc, 1u); // default: treat as resident |
There was a problem hiding this comment.
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided nc before using it to allocate ledger/host buffers and resize chunks_. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/common/kvflash_pager.h, line 589:
<comment>deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</comment>
<file context>
@@ -515,42 +530,79 @@ class KvFlashPager {
+ if (n < expected) return false;
+
+ // Read ledger into a temp buffer before reset() clears state.
+ std::vector<uint8_t> ledger_was_res(nc, 1u); // default: treat as resident
+ std::vector<float> ledger_scores(nc, -std::numeric_limits<float>::infinity());
+ if (has_led) {
</file context>
| **Verdict: PARTIAL-REFUTES Momus.** | ||
|
|
||
| 1-bit MSB is NOT random — Spearman ρ=0.87 vs FULL-QK. It strongly ranks keys. | ||
| But 1-bit mass-recall@10% = 0.80 (vs full 0.86), and reaches 0.9 only at k=30%. |
There was a problem hiding this comment.
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md, line 6:
<comment>Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</comment>
<file context>
@@ -0,0 +1,100 @@
+**Verdict: PARTIAL-REFUTES Momus.**
+
+1-bit MSB is NOT random — Spearman ρ=0.87 vs FULL-QK. It strongly ranks keys.
+But 1-bit mass-recall@10% = 0.80 (vs full 0.86), and reaches 0.9 only at k=30%.
+2-bit (magnitude only, no sign) = worse than random at count-recall. 3-bit ≈ full (ρ=0.97).
+
</file context>
…efill (opt-in) Pooled restore consumes the deserialized KV for the chunk-aligned prefix [0, snap_pos) and prefills only the suffix [snap_pos, prompt_len), behind KVFLASH_RESTORE_CONSUME (default 0 = legacy re-prefill). Validated: turn-2 prefill 36.9x faster (36.9s->1.0s) at ~35K tokens, with greedy AR output token-for-token IDENTICAL to the re-prefill path. Completes the snapshot x ledger x QK-pool composition (Phases 1-3). KNOWN CEILING: above ~35K tokens the AR output DIVERGES from full re-prefill (reused pooled KV differs from recompute once the 8192-pool evicts at scale). Do NOT enable default-on for deep context until that divergence is root-caused (acceptable KV-reuse near-tie flip vs real seam bug). Default-0 keeps production on the safe re-prefill path.
There was a problem hiding this comment.
2 issues found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/common/moe_hybrid_ffn_eval.cpp">
<violation number="1" location="server/src/common/moe_hybrid_ffn_eval.cpp:1076">
P1: This uniqueness scan can become non-terminating when initialized hot experts are fewer than routed slots. A low-hot-budget/all-cold batch can hang in the cached path.</violation>
</file>
<file name="server/test/test_kvflash_placement.cpp">
<violation number="1" location="server/test/test_kvflash_placement.cpp:26">
P3: Missing `#include <cstdint>` for `uint64_t`. Test file relies on transitive include from the header `kvflash_placement.h`, which makes it fragile against future header cleanup.</violation>
</file>
<file name="server/src/qwen35moe/qwen35moe_backend.h">
<violation number="1" location="server/src/qwen35moe/qwen35moe_backend.h:111">
P3: New private members are unused dead code (`hybrid_spec_graph_cache_`, `spec_microbench_done_`). Drop them until the cache/microbench path is actually implemented.</violation>
</file>
<file name="server/src/qwen35/qwen35_backend.cpp">
<violation number="1" location="server/src/qwen35/qwen35_backend.cpp:899">
P1: restore_and_generate ignores restore_target_cache failure. This can continue generation from invalid cache state instead of returning an error.</violation>
<violation number="2" location="server/src/qwen35/qwen35_backend.cpp:1198">
P2: Restore-consume misalignment path logs 'falling back to re-prefill' but actually hard-fails the request by returning -1.</violation>
</file>
<file name="server/test/test_kvflash_moe_paged.sh">
<violation number="1" location="server/test/test_kvflash_moe_paged.sh:61">
P2: Don't use `|| true` to swallow pipeline errors — store the exit code and include it in the failure diagnosis so debugging doesn't require reading tea leaves from an empty answer.</violation>
</file>
<file name="bench/abc_cache_harness/replay_harness.py">
<violation number="1" location="bench/abc_cache_harness/replay_harness.py:514">
P2: Configured `--port` is ignored when launching the server; server and client can target different ports.</violation>
<violation number="2" location="bench/abc_cache_harness/replay_harness.py:723">
P1: Per-repeat log offsets are reset to zero, so repeats after the first parse old log lines and report incorrect metrics.</violation>
<violation number="3" location="bench/abc_cache_harness/replay_harness.py:1177">
P2: Provenance always records tq3_0 cache types even when the selected arm runs with different KV cache types.</violation>
<violation number="4" location="bench/abc_cache_harness/replay_harness.py:1321">
P2: Summary print uses `log_path` outside its scope, crashing restart-per-turn executions.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/NOTES.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/NOTES.md:51">
P3: Truncated sentence in KV precision sweep analysis — `f16 best; q4_0 EQUAL ... q8_0 ANOMALOUS (lower accept 66.4` cuts off mid-thought with no closing paren or wrap-up for the section.</violation>
</file>
<file name="server/src/qwen35/gguf_target_loader.cpp">
<violation number="1" location="server/src/qwen35/gguf_target_loader.cpp:480">
P2: Drafter-provided capture layer IDs are trusted without range validation. Invalid IDs can silently skip feature capture and feed incomplete/stale capture vectors to the drafter path.</violation>
</file>
<file name="server/src/draft/draft_gguf_loader.cpp">
<violation number="1" location="server/src/draft/draft_gguf_loader.cpp:158">
P1: `target_layer_ids` element type is not validated before casting to `int32_t*`. A malformed or hostile GGUF can trigger invalid reads/UB during early metadata parsing.</violation>
</file>
<file name="harness/clients/session_inject_proxy.py">
<violation number="1" location="harness/clients/session_inject_proxy.py:125">
P2: `think_budget` uses truthiness, so `0` is treated as "unset" and skips `thinking` injection for `/v1/messages`.
(Based on your team's feedback about preserving meaningful zero-valued budget/count fields.) [FEEDBACK_USED]</violation>
<violation number="2" location="harness/clients/session_inject_proxy.py:143">
P3: Startup warning is inaccurate when only `THINK_BUDGET` is configured. It can mislead debugging because proxy is not pass-through in that mode.</violation>
</file>
<file name="harness/clients/run_claude_code.sh">
<violation number="1" location="harness/clients/run_claude_code.sh:79">
P2: `CLAUDE_TOOLS` config is now ignored because `--tools` was removed from the Claude CLI invocation. Re-add the flag so env-based tool scoping still works.</violation>
</file>
<file name="bench/qwen35moe_dflash/RECIPE.md">
<violation number="1" location="bench/qwen35moe_dflash/RECIPE.md:123">
P3: Broken reference: GOTCHAS.md does not exist in the recipe directory — readers following the link will hit a dead end.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/isolation2x2_results.json:89">
P2: Row 8 has gate_floor="slow" but populates spec-decode fields (accept_pct, avg_commit, decode_tps_spec) — contradicts the pattern in the other 3 slow-gated rows where those fields are null. Either gate_floor should be null (spec was active) or the spec fields should be null (spec was off).</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_results.md:131">
P3: Binary MD5 checksum in the summary table is truncated and inconsistent with the full 32-character MD5 in the header.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:204">
P1: Health check not tied to spawned server process, so benchmark could run against an unrelated server on the same fixed port</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:212">
P2: Configuration verification is non-enforcing: parsed mirror dtype/cap are printed but never compared to the expected values, so a misconfiguration silently corrupts benchmark attribution.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_isolation_2x2.py:315">
P2: Truthiness-based selection drops valid 0.0 TPS values in the summary table. Use explicit `is not None` checks, consistent with the adjacent metric lines.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/session_distribution.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/session_distribution.md:48">
P2: Cumulative context methodology is defined inconsistently: the methodology paragraph says tool-result/tool-use text is included in cumulative context, but section 2 defines it as only user typed-text + assistant text. This makes the distribution non-reproducible and can mislead readers about KV/pool pressure. Also reconcile the earlier statement about tool-use with the analyzer, which does not currently count tool-use content.</violation>
</file>
<file name="thoughts/shared/plans/cuda_graph_replay_team_plan.md">
<violation number="1" location="thoughts/shared/plans/cuda_graph_replay_team_plan.md:20">
P3: Inconsistent CUDA-graph build flag name in plan: blocker B uses `GRAPHS=ON` but the actual CMake flag and the rest of the plan use `GGML_CUDA_GRAPHS=ON`. This could cause implementers to invoke the wrong build toggle.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bench_equity_audit.md:89">
P2: Build flag in Arm B uses the shorthand `FA_ALL_QUANTS=OFF` instead of the actual CMake option `DFLASH27B_FA_ALL_QUANTS=OFF`, risking a misconfigured benchmark build.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/dense27b_rebaseline_results.json:10">
P2: `wall_s` is null in the rebaseline results even though the total wall time is present in `server_done`; the parser's regex does not match the actual log format.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/ar_vs_dflash_context_scaling.md:3">
P2: Provenance guarantee is not met: several table entries use abbreviated or missing file/path references, making benchmark numbers unverifiable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:44">
P2: Conflicting HumanEval+ dataset paths in the setup guide: section 1 references a non-existent `dflash/eval/humanevalplus.jsonl` while section 3 and the actual driver use `server/eval/humaneval_plus/humanevalplus.jsonl`. This could cause failed benchmark setup.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/beat_blog_setup.md:58">
P2: Inconsistent `--max-tokens` value for the 128K beat target: Section 2 uses 200 while Section 4 and the blog use 256, making benchmark results incomparable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:118">
P2: Benchmark report treats equal verify cost as a proven fact and uses it to conclude the performance gap is primarily the model, even though the document explicitly states the 3.5 target GGUF is unavailable and model vs implementation factors cannot be isolated in this environment. This overstates causality and could mislead readers.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:129">
P1: Verdict headline claims a '15% gap' but the file's own data shows a best-case gap of ~3.6% and a worst-case gap of ~5.6%, making the headline inconsistent with the reported benchmark results.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/model_ab_3.6_vs_3.5.md:139">
P2: Incorrect arithmetic in the TPS/AL decomposition invalidates the claim that AL masks ~42 tok/s of SSM overhead. The formula as written evaluates to ~179.5 tok/s, not 83, and the corrected normalization yields ~93.4 tok/s with a ~31 tok/s benefit.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:72">
P2: Hardcoded absolute `/home/peppi/...` input and output paths make the analyzer non-portable and fragile outside the author's environment.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/analyze_sessions.py:241">
P2: Context estimator implementation does not match its own methodology: tool_use blocks are omitted entirely and tool_result blocks are only counted for synthetic user messages, causing cumulative context statistics to be underestimated and the report's context-tier conclusions to be unreliable.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/humaneval_ddtree_results.json:4">
P2: Committed benchmark metadata contains non-portable absolute local paths (`/home/peppi/...`, `/tmp/...`) that leak environment details and break reproducibility on other machines or CI.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:98">
P2: kill_server sends SIGKILL without reaping the child; add proc.wait() to avoid zombie accumulation</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_earlyexit_frontier.py:199">
P2: Health check is not process-bound; a stale or external server on port 18081 can contaminate benchmark results.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:159">
P1: `--run-server` path omits the documented `flock` GPU lock because launch logic is duplicated and inconsistent between `launch_server_cmd()` and `launch_server()`. This can cause GPU contention and corrupt benchmark validity.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_humaneval_ddtree.py:545">
P2: When `--run-server` is used, the launched server endpoint is fixed to PORT (18081), but the benchmark traffic is sent to `args.url` which can be overridden via `--url`. This allows a user to accidentally launch a server on one port while benchmarking another endpoint, producing misleading results and incorrect cleanup. Either reject `--url` when `--run-server` is used, or derive the launch/poll URL from the user-supplied `--url`.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/ctx_065536.json">
<violation number="1">
P2: qwen35moe ctxsweep fixture uses model "luce-dflash-27b" instead of "luce-dflash".</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:69">
P1: Request failures are silently ignored; `send_request` does not check `result.returncode`, and `run_cell` never validates the response before extracting metrics.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_clean_rebaseline.py:190">
P1: CUDA error detection is broken due to a case mismatch: `line.lower()` is checked against the mixed-case literal `"CUDA error"`, so that branch can never match and CUDA errors may be missed.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:30">
P2: The benchmark table does not clarify that `prefill_tps` is computed from total prompt tokens (including the restored prefix), while `fresh_prefill` only counts uncached tokens. Without a note, the warm-cache rows look dramatically faster than the actual fresh-token throughput and can mislead readers comparing dense vs MoE performance.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/agentic_bestconfig_dense_vs_moe.md:96">
P2: Side-by-side table mixes metrics from different MoE configurations in the same "best" comparison row</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:156">
P2: Case-mismatched CUDA error check makes the CUDA error branch unreachable, so CUDA failures without the OOM literal are not detected and the OOM fallback is skipped.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:283">
P2: `is_ar` classification is inverted: it labels missing decode telemetry as AR floor and hides actual AR floor events.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:355">
P1: GPU_LOCK is defined and printed as an active flock path, but the script never acquires the lock. Concurrent GPU runs can overlap and contaminate benchmark results. Follow the convention used by neighboring scripts (`run_earlyexit_frontier.py`, `bit_identity_gate.py`) and acquire `/tmp/lucebox_gpu.lock` with `fcntl.flock` at startup.</violation>
<violation number="4" location="bench/qwen35moe_dflash/ctxsweep/run_dense27b_rebaseline.py:373">
P2: Fallback run errors are not checked in the fatal-stop logic. The `LOAD_FAIL` early-exit condition only checks `cell` (the first attempt) and ignores `cell2` (the fallback run), so a drafter load failure during the fallback would not stop the benchmark and subsequent cells would continue to run.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:61">
P2: Bit-identity gate uses approximate character-based token sizing instead of actual tokenization, weakening correctness guarantees at claimed context tiers</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:136">
P1: wait_for_server() checks a fixed port without referencing the launched subprocess, risking slow failure detection and false passes against an unrelated service on port 18081.</violation>
<violation number="3" location="bench/qwen35moe_dflash/ctxsweep/bit_identity_gate.py:358">
P2: Help text example for --extra-server-arg uses an argparse-unfriendly form for option-like values, causing missing-argument parse failures.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:253">
P2: MIT-licensed code snippets are included without the required copyright and permission notice text in the file; only a prose note is present, and no repository NOTICE file covers this document.</violation>
<violation number="2" location="bench/qwen35moe_dflash/ctxsweep/tbq4_kernel_technique.md:273">
P2: External source URLs use the upstream master branch instead of an immutable commit SHA, making the extracted technique documentation non-reproducible and prone to source drift.</violation>
</file>
<file name="server/src/common/kvflash_pager.h">
<violation number="1" location="server/src/common/kvflash_pager.h:589">
P2: deserialize() lacks an explicit, overflow-safe upper bound on the blob-provided `nc` before using it to allocate ledger/host buffers and resize `chunks_`. A corrupted snapshot can therefore drive oversized allocations or trigger overflow-prone size arithmetic.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/tq3_fast_attention_prior_art.md:5">
P2: External technical sources are not pinned to specific revisions, risking silent documentation drift for design-critical guidance.</violation>
</file>
<file name="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md">
<violation number="1" location="bench/qwen35moe_dflash/ctxsweep/phase0_bitplane_lsh.md:6">
P3: Factual inconsistency: the opening summary claims 1-bit mass-recall reaches 0.9 only at k=30%, but the presented table already shows ~0.89 at k=20% and contains no k=30% data, making the threshold misleading.</violation>
</file>
<file name="bench/abc_cache_harness/phase3_gate_intraproc.py">
<violation number="1" location="bench/abc_cache_harness/phase3_gate_intraproc.py:220">
P1: Gate can report PASS without verifying that the consume=1 arm actually restored from the snapshot at the seam.</violation>
</file>
<file name="bench/bitplane_lsh_experiment.py">
<violation number="1" location="bench/bitplane_lsh_experiment.py:335">
P2: scipy is imported only at the end of a long-running experiment and is not declared as a project dependency. A runtime environment without scipy will crash after all computation completes, producing no results.</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| "[kvflash] restore-consume: kv_offset=%d not chunk-aligned " | ||
| "(chunk_tokens=%d) — falling back to re-prefill\n", | ||
| kv_offset, prefill_ubatch); | ||
| set_last_error("kvflash: restore-consume misaligned offset"); |
There was a problem hiding this comment.
P2: Restore-consume misalignment path logs 'falling back to re-prefill' but actually hard-fails the request by returning -1.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/qwen35/qwen35_backend.cpp, line 1198:
<comment>Restore-consume misalignment path logs 'falling back to re-prefill' but actually hard-fails the request by returning -1.</comment>
<file context>
@@ -1141,20 +1174,35 @@ int Qwen35Backend::do_prefill(const std::vector<int32_t> & tokens,
+ "[kvflash] restore-consume: kv_offset=%d not chunk-aligned "
+ "(chunk_tokens=%d) — falling back to re-prefill\n",
+ kv_offset, prefill_ubatch);
+ set_last_error("kvflash: restore-consume misaligned offset");
+ return -1;
+ }
</file context>
…; enable consume default-on The consume-restored-KV path zero-padded kvflash_history_ for the restored prefix, poisoning the drafter residency scorer under DFLASH_KVFLASH+draft+qk-policy. Reconstruct it from the Phase-1 ledger scores so the drafter sees correct residency. Validated under the production spec-decode path: needle retrieved + drafter accept healthy at 64K/114K under consume. KVFLASH_RESTORE_CONSUME now defaults on (env=0 force-disables). Validation (35B-A3B-Q3_K_XL + dflash drafter + kvflash-policy=qk + q4_0 KV): ctx | C0 needle | C1 needle | C0 accept | C1 accept | C0 t3_s | C1 t3_s | speedup 64K | RETRIEVED | RETRIEVED | 10.9% | 10.9% | 132.7 | 0.2 | 663x 114K | RETRIEVED | RETRIEVED | 10.9% | 10.9% | 165.7 | 1.7 | 97x
What
Pooled chunked prefill for qwen35moe (Qwen3.6-35B-A3B) over KVFlash: when the
prompt exceeds the resident pool, prefill loops
hybrid_forward_batchoverchunk-sized slices with live eviction instead of refusing. Plus pooled
snapshot/restore (save/restore the bounded pool across requests) and a
complexity-only refactor (dedup the two identical restore chunk loops, extract
chunked_prefill, inline a single-caller helper — net −25 LOC, behaviour-identical).Stacking
This is the tip of the KVFlash-MoE stack and depends on:
Until those merge this PR's diff includes their commits; rebasing after they land
leaves only the prefill-snapshot + refactor commits.
Tests
test_kvflash_moe_paged.sh— GPU silent-corruption gate: a sink fact in thefirst (protected) chunk is recalled after the middle is evicted, and the greedy
(temp-0) answer is identical across two pool sizes. Green on RTX 3090 / Q3_K_M.