bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug by easel · Pull Request #1 · dusterbloom/lucebox-hub

easel · 2026-04-30T15:48:53Z

Targets feat/prefix-cache (PR Luce-Org#59). Drafted for review.

Five commits, fairly cleanly split: bench loader → bench tooling fixes → repro script → upstream cuda fix.

1. Faithful transcript replay (`bench_agent_loop.py`)

extract_user_turns reads only type=user records where content is a str. Real Claude Code transcripts have user records of tool_result blocks with content as a list — silently skipped. The bench replays only typed human prompts with bench-synthesised assistant replies, with no tool I/O.

The replacement loader (_load_transcript + _to_openai_messages) coalesces consecutive same-role records into single turns, converts Anthropic blocks → OpenAI structured messages (tool_use → assistant.tool_calls, tool_result → role=tool, thinking dropped), and at each assistant index sends the exact prefix that was sent at that point. Validated against a 65-record session: old loader extracts 7 typed-user turns; new loader walks all 32 call points with prefix at the last call = 95,536 chars. Targets dflash/scripts/server_tools.py (the production agent endpoint).

Also: streaming SSE for separate TTFT measurement, and a token-count fallback that counts content/reasoning/tool deltas when usage.completion_tokens isn't emitted (PR Luce-Org#59's server doesn't honour stream_options.include_usage).

2. Standalone empty-response reproducer (`repro_empty_response.py`)

Self-contained, no transcript dependency. Generates 6 growing-prefix calls (130 → 5K → 10K → 15K → 20K → 26K chars) with synthetic pylint-style tool results, runs against slots=0 (control) then slots=2 (repro), prints a side-by-side table.

Pre-fix (slots=2): warm call 2 returns content=0 / comp_tok=0 / finish=stop after 9.7s of real prefill; calls 3-6 return empty in <50ms (cache "hits" on a corrupted slot). Reproduces deterministically.

3. Root cause + upstream fix (`dflash/deps/llama.cpp`)

The bug is in upstream ggml-cuda's VMM allocator, not in PR Luce-Org#59 itself. PR Luce-Org#59 just reliably triggers it because its inline-snap pattern interleaves compute with allocations on the same backend.

Mechanism:

The VMM pool's alloc() path extends the pool via cuMemCreate + cuMemMap + cuMemSetAccess when an allocation can't fit in the existing mapped region.
cuMemSetAccess races with previously-queued async work and returns CUDA_ERROR_NOT_READY.
CU_CHECK fires GGML_ABORT, but in some process configurations the abort doesn't terminate (gdb-fork backtrace path interferes, SIGABRT swallowed somewhere — exact mechanism unclear, but reproducible).
The mapping is registered but access permissions are never set.
Subsequent reads/writes silently return zeroed/garbled data — for the prefix cache, snapshots get stored into the broken region and restore as garbage, making the model emit 0 tokens.

Fix is one cudaDeviceSynchronize() before the cuMem* sequence inside the pool extension branch:

if (size > avail) {
    // ... compute reserve_size ...
    GGML_ASSERT(pool_size + reserve_size <= CUDA_POOL_VMM_MAX_SIZE);

    CUDA_CHECK(cudaDeviceSynchronize());   // ← new

    // allocate more physical memory
    CU_CHECK(cuMemCreate(...));
    ...
}

Only fires when the pool actually needs to grow, so steady-state hot-path allocations are unaffected.

llama.cpp PR: Luce-Org/llama.cpp-dflash-ggml#4

This PR's submodule URL is temporarily pointed at easel's fork (fix/cuda-vmm-pool-extension-race branch) until that PR merges; revert to Luce-Org/llama.cpp.git@luce-dflash after merge.

Also bumps prefix_cache.startup_sync's await_reply timeout 10s → 60s for daemons with multi-slot snap pools at large --max-ctx.

Validation

repro_empty_response.py exits 0 (repro confirmed) pre-fix, exits 1 ("DID NOT REPRO") post-fix.

4-session sweep on RTX 3090 Ti, CUDA 13.2, Qwen 3.6 27B, --max-ctx=24576 --warm-slots=2:

session	calls	TTFT cold	TTFT warm	TTFT ×	wall ×	empties
lucebox-hub (≤11K char prefix)	10	40.0s	39.9s	1.00×	0.77×	0/10
nexiq-small (≤25K)	6	55.9s	44.9s	1.24×	0.75×	0/6
axon-med (≤39K)	10	133.8s	51.7s	2.59×	1.27×	0/10
helix-large (≤71K)	10	242.1s	97.3s	2.49×	1.52×	0/10

Zero empty responses across 36 warm calls. Cache wins scale with prefix size — 2.49–2.59× TTFT and 1.27–1.52× wall on real long-prefix agentic sessions is the headline win this PR (and PR Luce-Org#59) set out to validate. Short sessions barely break even, as expected (cache-creation overhead vs. limited reuse).

Test plan

repro_empty_response.py — pre-fix repros, post-fix doesn't
4-session sweep — 36/36 warm calls real
Build: clean nvcc, no warnings
Cross-check on Blackwell / sm_120 — only validated on RTX 3090 Ti / sm_86
Confirm test_multi_turn_prefix_cache.py still passes — should, since it's checking turn-2-5 ratios on a small fixture that probably didn't trigger the bug anyway

Status

Draft. Two PRs need review:

This PR (lucebox-hub): bench tooling + repro + submodule bump.
llama.cpp PR (fix(cuda): sync device before cuMem pool extension to fix VMM allocator race Luce-Org/llama.cpp-dflash-ggml#4): the actual fix.

Happy to split this PR if reviewer prefers (bench-only vs submodule-bump), but the bench changes don't make sense without the underlying fix being available.

🤖 Generated with Claude Code

Adds a snapshot/restore mechanism so the C++ daemon can preserve target KV + SSM/conv + target_feat state across HTTP requests. Subsequent turns in an agent loop that share a system prompt skip the system-prefill cost (previously paid in full on every turn since the daemon called free_target_cache + create_target_cache between requests). C++ side -------- - New PrefixSnapshot struct (internal.h): owns its own ggml_context + backend buffer, holds slim KV per layer + SSM/conv/target_feat per layer + cur_pos + last_tok + kv_k_type + max_ctx for sanity checks. Skips ssm_intermediate / conv_input_cache (within-decode rollback buffers, regenerated on first decode step after restore). - snapshot_target_cache, restore_target_cache, free_prefix_snapshot in qwen35_target_graph.cpp using ggml_backend_tensor_copy. Lazy alloc (first SNAPSHOT call), reuse on subsequent refreshes. - TargetCache gains a last_tok field, used solely by the prefix-cache bridge: when restored cur_pos == prompt_len the prefill loop runs zero iterations and the decode seed comes from the restored last_tok. Daemon protocol (test_dflash.cpp) --------------------------------- - Adds 4 new commands on stdin, dispatched before the legacy bare prompt line: SNAPSHOT N, RESTORE N <prompt> <n_gen>, FREE_SNAPSHOT N, LIST_SLOTS. Replies on stdout: [snap] slot=N cur_pos=P / [snap] freed slot=N / [snap] slots=A,B,C. - prefill loop reads from cache.cur_pos as start (0 for fresh, >0 after restore). Restored cache + matching-length prompt -> zero-iter prefill, decode seeds from cache.last_tok. - Hard cap of PREFIX_CACHE_SLOTS = 8 in the daemon. - End-of-iteration writes cache.cur_pos = out_all.size() and cache.last_tok so the next SNAPSHOT command captures correct boundary. - Frees all snapshot slots on daemon exit. Python side ----------- - New scripts/prefix_cache.py: * DaemonStdoutBus owns the stdout read loop, routes [snap]-prefixed lines to waiting coroutines, suppresses noisy [step]/[timing] logs. * PrefixCache stores hash -> slot_id LRU. lookup() returns (slot_id, prefix_len) or None. maybe_snapshot() does a SECOND n_gen=0 prefill of the prefix-only tokens, then SNAPSHOT — this aligns the snapshot's cur_pos exactly with the cache key's prefix length (one extra system prefill on cold turns, recovered many times over on subsequent warm turns). * find_prefix_boundary auto-detects the FIRST end-of-system-message boundary in Qwen chat templates, allowing one intervening newline token between im_end and im_start. * hash_prefix uses SHA-1 truncated to 16 bytes over (token ids, kv_k_type, fa_window). * DAEMON_MAX_SLOTS = 8 clamp; cap > limit emits a warning. - server.py + server_tools.py: * --prefix-cache-slots N CLI flag (default 4, 0 disables). * Daemon spawn now uses stdout=PIPE so DaemonStdoutBus can route protocol replies. * Resolve effective KV-K type + fa_window from DFLASH27B_* env vars at daemon spawn time (mirrors C++ daemon's env parsing) and pass into PrefixCache so they're part of the hash key — daemon restart with different flags can't return stale state. * 4 lookup/maybe_snapshot call sites per file (stream + non-stream for /v1/chat/completions and /v1/messages). On miss send the bare prompt line, then maybe_snapshot drains via _drain_pipe_to_sentinel helper so the next protocol command is clean. Verification ------------ - nm: new symbols snapshot_target_cache, restore_target_cache, free_prefix_snapshot in libdflash27b.a. - C++ smoke (manual /tmp/smoke_restore.py): cold prompt n_gen=8 -> [a,b,c,d,e,f,g,h] cold same prompt n_gen=4 + SNAPSHOT 0 -> shared_4 = [a,b,c,d] RESTORE 0 + n_gen=4 -> warm_4 = [e,f,g,h] byte-equal continuation. - End-to-end (test_server_prefix_cache.py): 5K-token system prompt, three turns at max_tokens=8. turn_1 9.87s (cold + snapshot warm-up) turn_2 0.48s ratio_2/1 = 0.05 turn_3 0.44s ratio_3/1 = 0.04 All replies non-empty and consistent. ~20x speedup on warm turns. Reviewed by codex; this commit incorporates the two correctness fixes flagged: hash inputs now use real env-var-derived values instead of hardcoded "q8_0"/2048 literals, and Python cap is clamped to the daemon's PREFIX_CACHE_SLOTS = 8 hard limit so configurations above it can't cause silent SNAPSHOT failures. The third codex finding (boundary detector won't handle tool-definition preambles or multi-segment system messages) is documented as a follow-up under server_tools.py — current detector covers the simple Qwen system+user case; tool-using clients fall back to no-cache silently. Plan file: ~/.claude/plans/yes-please-plan-for-luminous-pudding.md Phase A (~1 week scope) of a 4-phase agentic-friendly KV/state plan. Phase B (block-chain mid-conversation cache), Phase C (sliding KV growth), Phase E (tool-loop incremental tokenization) are deferred to follow-up commits.

Extends Phase A's single-point system-prompt cache to multi-slot LRU that snapshots at every chat-template role boundary, so multi-turn agent loops hit progressively deeper cached state on each new turn. C++ side (B.1 + B.2) -------------------- - PrefixSnapshot gains is_thin / kv_start / kv_end fields and two new primitives: snapshot_target_cache_thin and restore_target_cache_chain. Thin snapshots capture only KV slice [kv_start, kv_end); chain restore loads a thick base then layers thins. Implemented via per-strip H2D+D2H staging since ggml_backend_tensor_copy refuses views with mismatched layouts (verified by spike_thin_copy.cpp on Q8_0 / TQ3_0 / F16). - Daemon protocol: SNAPSHOT_THIN N kv_start kv_end and RESTORE_CHAIN thick_slot thin_slots prompt_file n_gen. The thin/chain primitives remain unused by Phase B's actual flow (see "design pivot" below) but are kept for future block-chain extensions. Design pivot ------------ Original plan called for a thick-anchor + thin-chain cache. On implementation it became clear that thin snapshots only capture KV; SSM/conv state can't be reconstructed from KV alone (DeltaNet recurrence is non-replayable without re-running prefill). A chain restore would land at the thick's cur_pos with valid SSM, then need DeltaNet replay through the thin range — defeating the savings. Pivoted to a simpler "multi-slot THICK LRU" design that delivers the same user-visible win: cache full state at multiple block boundaries, restore the deepest matching THICK on lookup, prefill only the new suffix. Memory cost (4 thick slots × ~244 MB ≈ 1 GB) matches what the thick+thin chain would have used. Python side (B.3 + B.4) ----------------------- - find_all_boundaries enumerates every <|im_end|><|im_start|> boundary after the system marker (allows up to 2 intervening tokens to handle the newline separator Qwen emits). - PrefixCache.lookup walks all candidate cuts and returns the deepest cached match (longest-prefix); LRU touched on every hit. - PrefixCache.maybe_snapshot iterates ALL boundaries on cache miss and snapshots each that's not already cached, evicting LRU when over cap. - Each snapshot still uses Phase A's n_gen=0 prefill + SNAPSHOT pattern to land at the exact boundary cur_pos. Multi-snapshot increases cold-turn latency proportionally (e.g. 5-turn test: turn 1 13.5 s vs Phase A's ~10 s), but turns 2-5 all benefit. - server.py / server_tools.py: zero changes — API surface stayed the same (lookup returns (slot, prefix_len) or None). Tests (B.5) ----------- - spike_thin_copy.cpp validates the per-strip staging-copy approach used by snapshot_target_cache_thin (works on Q8_0, TQ3_0, F16). - test_multi_turn_prefix_cache.py: 5-turn agent loop, ~2K-token system prompt, growing history. RTX 3090 + Qwen3.6-27B-Q4_K_XL: turn 1 13.53 s (cold + multi-snapshot warm-up) turn 2 0.55 s ratio 0.04 turn 3 0.70 s ratio 0.05 turn 4 0.85 s ratio 0.06 turn 5 1.23 s ratio 0.09 All warm turns < 30 % of cold turn 1; turn 5 still 11x faster than turn 1. - Existing test_server_prefix_cache.py (3-turn shared system prompt) remains green: turn 2/3 at 3 % of turn 1. Codex review of Phase A's hardcoded hash inputs and slot-cap mismatch were addressed in the Phase A commit (e429894). Codex's third finding (boundary detector won't handle tool-definition preambles in server_tools.py) is still open and tracked as a follow-up; the new find_all_boundaries inherits that limitation. Bench branch: feat/prefix-cache (cumulative Phase A + B). Plan files at ~/.claude/plans/yes-please-plan-for-luminous-pudding.md (Phase A) and ~/.claude/plans/phase-b-block-chain-cache.md (Phase B, including the design pivot rationale).

prepare_inline_snap was popping the LRU entry up-front so the daemon could overwrite that slot. If the request aborted before confirm_inline_snap ran, the old entry was already gone AND the new one was never registered, stranding a daemon slot until process restart. Reserve the slot via _pending_evict_key without removing the old entry; pop + insert atomically in confirm_inline_snap. Add abort_inline_snap for explicit cancellation. Also adds bench_agent_loop.py — replays real Claude Code session JSONL turns through the dflash server with prefix-cache off vs on. On 5 short real-session turns: turn-1 6.28x (page cache + warmup), turns 2-5 ~equal because real-session prompts are too short for prefix-cache to dominate. The synthetic 2K-system test (test_multi_turn_prefix_cache.py) is where the cache actually wins. Both issues raised in the codex review of the Phase B + B.7 + B.8 work; the High one (last_tok=-1 after no-op restore prefill) was already covered by the cache.last_tok bridge added earlier.

The current bench reads only `type=user` records where `content` is a `str`. In a real Claude Code transcript every user record after the first is a list of `tool_result` blocks with `content` as a `list` — all silently skipped. The bench replays only typed human prompts with bench-synthesised assistant replies in between, with no tool I/O. Tool I/O is the bulk of an agentic prefix: typical prefix grows from ~5K chars at turn 1 to 60-300K by turn 30. Validated against a real session (32 assistant turns, ~95K chars at the last call): the old loader extracts 7 typed-user turns; the new loader walks all 32 call points with faithful prefix growth. The PR's own commit message ("Real-session prompts are too short for the cache to dominate; cold and warm turns 2-5 are within noise") is a measurement artefact of the loader, not a property of the cache. Replace `extract_user_turns()` with a transcript loader that: 1. Coalesces consecutive same-role records into single turns (one logical LLM turn = N JSONL rows). 2. Converts Anthropic blocks → OpenAI messages: text → content, tool_use → assistant.tool_calls, tool_result → tool message, thinking dropped. 3. At each assistant index, sends the exact prefix that was sent at that point (system + everything before this assistant turn, tool I/O included) and advances state via the recorded assistant turn, not a bench-synthesised one. Also switch the chat call to streaming SSE so we measure TTFT separately from total wall — TTFT is what the prefix cache accelerates, total wall mixes prefill speedup with decode rate. Preserved: cold (slots=0) vs warm (slots=N) dual-server structure, --turns / --session interface, per-turn ratio table (now with TTFT and wall columns). Default session dir derives the workspace from CWD (replace `/` with `-`) instead of hardcoding `-home-peppi-Dev-lucebox-hub`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three follow-ups after running v1 against the PR's server: 1. The server's ChatRequest/ChatMessage schema requires `content` and ignores `tool_calls`/role=tool, so emitting structured tool messages produces 422s on every call after the first. Flatten tool_use → `<tool_call name=X>{json args}</tool_call>` text in assistant content; tool_result → `<tool_response id=X>...</tool_response>` text in user content. One message per turn, role in {system,user,assistant}, content always a string. Same on-wire prefix bytes (which is what the cache cares about), runs cleanly against the PR's server. 2. Add a discarded warmup call before the timed loop. Without this the first cold call eats ~95s of CUDA graph capture / kernel JIT one-time cost and dominates the totals. 3. Restore PR Luce-Org#59's `"You are a precise coding assistant..."` system prompt at message[0]. Realistic shape, deterministic prefix. 4. Bump default --n-gen 8 → 64. Qwen 3.6 is a thinking model that spends tokens in `reasoning_content`; 8 was too tight to ever emit a completion token. (Headline metric is TTFT regardless, but a non-zero n_tok lets us report decode tok/s when present.) Numbers from a 10-turn replay of a real session (15 → 11,350 chars, RTX 3090 Ti, Q4_K_M): total TTFT: 79.90s cold → 38.02s warm = 2.10x total wall: 97.62s cold → 93.30s warm = 1.05x (decode-bound) best call: 6.58s TTFT cold → 1.61s warm = 4.09x at 11K-char prefix Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

easel · 2026-04-30T16:13:07Z

Ran the bench against PR Luce-Org#59's server. Real session, 10 turns, RTX 3090 Ti, Qwen 3.6 27B Q4_K_M, max-ctx=16384, slots=0 vs slots=4.

Headline

metric	cold	warm	speedup
total TTFT	79.90s	38.02s	2.10×
total wall	97.62s	93.30s	1.05×
best call (call 10, 11.3K chars)	6.58s TTFT	1.61s	4.09×

Wall barely moves because Qwen 3.6 is a thinking model: with --n-gen 64 every call burns the full budget in reasoning_content (n_tok=0 reported) and the wall is dominated by reasoning decode, which the prefix cache doesn't accelerate. TTFT — which is what the cache actually targets — halves overall and shows the expected "deeper prefix → bigger win" pattern at call 10.

Per-call

call  in_chars  cold ttft  warm ttft  ttft x   cold wall  warm wall  wall x
   1        93     9640ms      828ms  11.65x      11.19s      2.54s   4.41x
   2     1,236    11647ms     3393ms   3.43x      12.60s      5.12s   2.46x
   3     1,447     2231ms     4885ms   0.46x       3.23s      6.75s   0.48x
   4     1,633    13245ms     2128ms   6.23x      14.75s     11.84s   1.25x
   5     2,020     3136ms     2106ms   1.49x       4.60s     10.47s   0.44x
   6     4,431     7219ms     6615ms   1.09x       9.51s     15.32s   0.62x
   7     5,236     9580ms     6234ms   1.54x      11.51s     12.18s   0.94x
   8    10,504     6279ms     4969ms   1.26x       8.94s     10.20s   0.88x
   9    10,753    10347ms     5256ms   1.97x      12.67s     10.63s   1.19x
  10    11,350     6579ms     1607ms   4.09x       8.61s      8.23s   1.05x

Per-call is noisy (calls 3 and 5 are slower warm than cold) — single run, shared GPU with the host, decode variance from a thinking model. Aggregate is the meaningful signal.

Three changes pushed in `7c182c9`

While running it I hit three issues with v1; followed up with bench(prefix-cache): flatten tool I/O + warmup + system prompt:

Tool I/O flattening. Your ChatRequest/ChatMessage schema requires content and ignores tool_calls / role=tool. v1 emitted structured tool messages and got a 422 on every call after the first. v2 flattens: tool_use → <tool_call name=X>{args}</tool_call> text inside the assistant content, tool_result → <tool_response id=X>...</tool_response> text inside the user content. Single message per turn, role in {system,user,assistant}. Token counts on the wire stay close to the original (which is what the cache cares about) and the chat template wraps each turn the same way. When the server gains tool support upstream, this can cleanly switch back to structured messages — happy to do that follow-up.
Warmup call. First cold call was eating ~95s of one-time CUDA graph / kernel JIT cost. Now there's a discarded max_tokens=1 warmup before the timed loop on each server.
System prompt restored. v1 dropped your default "You are a precise coding assistant..." system message. Restored at messages[0] for realistic prefix shape.

Also bumped default --n-gen 8 → 64 so a thinking model has room to emit a completion token after reasoning. Doesn't change any TTFT numbers but unlocks a real decode tok/s reading on non-thinking models.

Still draft pending your review of the approach.

…k fallback Three follow-ups after running v2 end-to-end: 1. Point at server_tools.py instead of server.py. server_tools is the production agent-CLI endpoint and has the prefix cache fully wired in (lookup / prepare_inline_snap / confirm_inline_snap at all four /v1 handlers). server.py doesn't accept tool_calls in its schema, which led v2 to flatten tool I/O into <tool_call>/<tool_response> text — that worked but obscured what the bench was actually measuring. 2. Revert the flattening hack in _to_openai_messages. Now emits proper structured tool messages: tool_use → assistant.tool_calls[].function.{name,arguments} tool_result → role="tool" message with tool_call_id server_tools accepts this directly (ChatMessage.content: Any | None, tool_calls + tool_call_id fields). What's on the wire matches what real OpenAI-compat agent CLIs send, so the bench measures the path that production traffic actually takes. 3. Token-count fallback in _stream_chat. PR Luce-Org#59's server does NOT honour stream_options.include_usage — no usage chunk is ever emitted on /v1/chat/completions. Without a fallback every call reports n_tok=0 even when 64 content deltas streamed. Now we prefer usage.completion_tokens when present, otherwise count content/reasoning/tool deltas as a proxy. System prompt and warmup-call still in place from the previous commit; --n-gen default still 64. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Self-contained reproducer for the multi-slot inline-snap regression in prefix_cache.py + qwen35_target_graph.cpp. No transcript dependency: generates a 6-call growing-prefix sequence with synthetic pylint-style tool results, runs against slots=0 (control) then slots=2 (repro), prints per-call content/reasoning/finish + a side-by-side table. Trigger profile: starting at the second multi-turn call (~5K char prefix), warm responses become content_len=0 / comp_tok=0 / finish_reason=stop; subsequent calls return in <50 ms, also empty. Cold path on the same prompts produces 64 tokens per call. Suppresses the GGML gdb-fork backtrace handler via GGML_NO_BACKTRACE=1 so the daemon log stays readable when ggml-cuda hits its "device not ready" error path during the lazy snap-buffer alloc. Usage: python3 dflash/scripts/repro_empty_response.py \ --target /path/to/Qwen3.6-27B*.gguf \ --draft /path/to/qwen3.6-27b-dflash \ --bin /path/to/dflash/build/test_dflash \ --server /path/to/dflash/scripts/server_tools.py Exits 0 on confirmed repro, 1 if every warm call produced output (threshold not reached — bump --n-turns / --tool-chars), 2 if cold itself returned empty (different problem). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…odule Bumps the llama.cpp submodule to a fix that addresses the prefix-cache empty-response bug at its root: ggml-cuda's VMM allocator's pool extension via cuMemSetAccess races with in-flight async work and returns CUDA_ERROR_NOT_READY. The CU_CHECK macro hits GGML_ABORT but the abort doesn't actually terminate, leaving the just-mapped region without access permissions. Every subsequent read/write into that region silently misbehaves — for the prefix cache, snapshots of KV state get stored into the broken region and restore as zeroed/garbled state, making the model emit 0 tokens with finish_reason=stop. Manifests on PR Luce-Org#59's inline-snap path because it interleaves compute with allocations on the same backend (snapshot copies during prefill followed by gallocr / rollback / cache rebuild allocations). The fix adds a cudaDeviceSynchronize before the cuMem* sequence in the pool extension branch — only fires when the pool actually grows, so steady-state hot-path allocations are unaffected. llama.cpp PR: Luce-Org/llama.cpp-dflash-ggml#4 Submodule URL temporarily pointed at easel's fork (branch fix/cuda-vmm-pool-extension-race) until the upstream PR merges. After merge, revert .gitmodules to Luce-Org/llama.cpp.git@luce-dflash and bump the submodule pointer to the merge commit. Also bumps prefix_cache.startup_sync's await_reply timeout 10s → 60s for daemons with multi-slot snap pools at large max-ctx. Validated on RTX 3090 Ti, CUDA 13.2, Qwen 3.6 27B, max-ctx=24576, slots=2: session turns TTFT cold TTFT warm TTFT x wall x empties lucebox-hub 10 40.0s 39.9s 1.00x 0.77x 0/10 nexiq-small 6 55.9s 44.9s 1.24x 0.75x 0/6 axon-med 10 133.8s 51.7s 2.59x 1.27x 0/10 helix-large 10 242.1s 97.3s 2.49x 1.52x 0/10 36/36 warm calls produce real content. Cache delivers 2.49–2.59x TTFT speedup on long agentic prefixes (38K–70K chars) — the headline win this PR set out to validate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…s layers Cross-attention with TQ3_0 KV cache produced accept_rate=0 because three separate issues compounded: 1. K/V views were cast from TQ3_0 to F16/F32 before ggml_flash_attn_ext. The CUDA FA kernels apply forward FWHT to Q (and inverse FWHT to the output) only when they observe K->type == GGML_TYPE_TQ3_0 (fattn-chunked.cu:228,394; fattn-vec.cuh:168). Casting stripped the type tag, FA picked a non-WHT kernel, and Q (real domain) dotted with K (FWHT domain, just unpacked into F16) produced meaningless scores. Removed the cast; Kfa/Vfa now reach FA with native TQ3_0. 2. TQ3_0 K is iterated in 128-element block strides; an unaligned ne[1] reads past the valid window into stale cache cells. Previously we only padded for head_dim>=512; SWA layers (head_dim=256) skipped padding and silently corrupted attention. Extended needs_kv_pad to fire for any TQ3_0 cache, mirroring gemma4_target_graph.cpp's need_256_pad policy. 3. Each layer created its own FA mask input tensor but only the last one was exposed via out.fa_mask. After fix #2 all four layers needed masks; the unfilled mask buffers contained uninitialised CUDA memory (cudaMalloc is not zeroed), causing NaN logits on subsequent steps. Hoisted a single shared mask out of the per-layer loop. The builder now asserts that all need-mask layers want the same (width, kv_seq_len) and fails loudly if a future long-context build wants per-layer masks (SWA cap < full attn_pos), instead of silently doing the wrong thing. Trajectory: pre-fix: accept_rate = 0.00 (varying garbage tokens) fix #1 only: accept_rate = 0.00 (drafts pinned to a single token) fix #1+#2: step 1 OK, step 2+ NaN fix #1+#2+Luce-Org#3: accept_rate = 0.22 (Q4_K_M target + Q8_0 assistant, TQ3_0 KV, 131-token prompt, 64 generation steps) Adjacent infrastructure: - create_gemma4_cache(): extra_q8_layers param to force Q8_0 on specific MTP donor layers when needed. - get_mtp_swa_pattern(): lightweight helper reading MTP SWA layout from GGUF without loading tensors. - MTP loader: load centroids/token_ordering whenever n_centroids>0 (graph builder decides whether to use them). - Test caller: fills out.fa_mask before each compute; dropped the per-step diagnostic prints that are no longer needed. Known follow-ups (not blocking): - Long-context multi-mask: SWA cap < full attn_pos trips the assert. - SWA-wrap branch concat-forces F32 on TQ3_0, losing the WHT path. - Accept rate 0.22 is in expected range; remaining gap to spike's reference numbers may come from quantization, RoPE source, or attention scale.

…ync stream) #1: recommit verify wrote accept_n+2 KV slots vs fast-path's accept_n+1; runner's base_pos += accept_n+2 then re-verified bonus at its own slot, skipping every Dth token at γ≥3 partial-accept (reproduced AR/spec divergence in new test_recommit_byte_identical_to_ar). #2: capture_topology_for_chain() virtual; runner owns the call. verify_batch no longer mutates last_tree_*. Luce-Org#3: dedicated rollback CUDA stream + cudaMemcpy2DAsync batching in restore_kv_at_dfs (4× fewer launches per layer). Bug Luce-Org#5 (step_sg_cache O(n_ctx)) deferred — needs ggml_set_rows refactor.

…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, Luce-Org#207 already merged) plus the first slice of howard0su's PR Luce-Org#214 review request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. ## What landed ### Foundation (rebase port, ~5k LOC) - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict ### Step 3.1 — common::mtp::warm_and_decode entry point (TDD red→green) Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS ## Steps 3.2-3.5 (separate commits, this PR) 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice

…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice

… (seeds #1, #2) - StubServer: ThreadingHTTPServer recorder, zero new deps (mirrors llamacpp_compat_proxy.py pattern) - Seed #2 green: proxy injects session_id on /v1/messages, preserves existing, passes through GET - Seed #1 documented: chat/completions round-trip passes; injection assertion commented out pending commit 3

…#1) - Add INJECT_ROUTES frozenset: /v1/messages, /v1/chat/completions, /v1/responses - do_POST checks route_base in INJECT_ROUTES (query-string-safe) - Seed #1 green: chat/completions round-trip injects session_id - Add /v1/responses injection test (codex route)

mraxai and others added 5 commits April 29, 2026 14:05

easel and others added 2 commits April 30, 2026 14:26

easel changed the title ~~bench(prefix-cache): faithful Claude Code transcript replay + TTFT~~ bench: faithful Claude Code transcript replay + repro & fix for multi-slot empty-response bug Apr 30, 2026

easel force-pushed the bench-agent-loop-faithful-replay branch from 6d74d18 to 81e12e4 Compare May 1, 2026 03:08

easel changed the title ~~bench: faithful Claude Code transcript replay + repro & fix for multi-slot empty-response bug~~ bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug May 1, 2026

easel force-pushed the bench-agent-loop-faithful-replay branch from 81e12e4 to dedaf69 Compare May 1, 2026 03:44

dusterbloom force-pushed the feat/prefix-cache branch from b597e8f to 7e143f9 Compare May 1, 2026 16:28

davide221 force-pushed the feat/prefix-cache branch from e216877 to cbdf9c8 Compare May 3, 2026 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug#1

bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug#1
easel wants to merge 8 commits into
dusterbloom:feat/prefix-cachefrom
easel:bench-agent-loop-faithful-replay

easel commented Apr 30, 2026 •

edited

Loading

Uh oh!

easel commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

easel commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Faithful transcript replay (bench_agent_loop.py)

2. Standalone empty-response reproducer (repro_empty_response.py)

3. Root cause + upstream fix (dflash/deps/llama.cpp)

Validation

Test plan

Status

Uh oh!

easel commented Apr 30, 2026

Headline

Per-call

Three changes pushed in 7c182c9

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

easel commented Apr 30, 2026 •

edited

Loading

1. Faithful transcript replay (`bench_agent_loop.py`)

2. Standalone empty-response reproducer (`repro_empty_response.py`)

3. Root cause + upstream fix (`dflash/deps/llama.cpp`)

Three changes pushed in `7c182c9`