Skip to content

bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug#1

Draft
easel wants to merge 8 commits into
dusterbloom:feat/prefix-cachefrom
easel:bench-agent-loop-faithful-replay
Draft

bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug#1
easel wants to merge 8 commits into
dusterbloom:feat/prefix-cachefrom
easel:bench-agent-loop-faithful-replay

Conversation

@easel
Copy link
Copy Markdown

@easel easel commented Apr 30, 2026

Targets feat/prefix-cache (PR Luce-Org#59). Drafted for review.

Five commits, fairly cleanly split: bench loader → bench tooling fixes → repro script → upstream cuda fix.

1. Faithful transcript replay (bench_agent_loop.py)

extract_user_turns reads only type=user records where content is a str. Real Claude Code transcripts have user records of tool_result blocks with content as a list — silently skipped. The bench replays only typed human prompts with bench-synthesised assistant replies, with no tool I/O.

The replacement loader (_load_transcript + _to_openai_messages) coalesces consecutive same-role records into single turns, converts Anthropic blocks → OpenAI structured messages (tool_use → assistant.tool_calls, tool_result → role=tool, thinking dropped), and at each assistant index sends the exact prefix that was sent at that point. Validated against a 65-record session: old loader extracts 7 typed-user turns; new loader walks all 32 call points with prefix at the last call = 95,536 chars. Targets dflash/scripts/server_tools.py (the production agent endpoint).

Also: streaming SSE for separate TTFT measurement, and a token-count fallback that counts content/reasoning/tool deltas when usage.completion_tokens isn't emitted (PR Luce-Org#59's server doesn't honour stream_options.include_usage).

2. Standalone empty-response reproducer (repro_empty_response.py)

Self-contained, no transcript dependency. Generates 6 growing-prefix calls (130 → 5K → 10K → 15K → 20K → 26K chars) with synthetic pylint-style tool results, runs against slots=0 (control) then slots=2 (repro), prints a side-by-side table.

Pre-fix (slots=2): warm call 2 returns content=0 / comp_tok=0 / finish=stop after 9.7s of real prefill; calls 3-6 return empty in <50ms (cache "hits" on a corrupted slot). Reproduces deterministically.

3. Root cause + upstream fix (dflash/deps/llama.cpp)

The bug is in upstream ggml-cuda's VMM allocator, not in PR Luce-Org#59 itself. PR Luce-Org#59 just reliably triggers it because its inline-snap pattern interleaves compute with allocations on the same backend.

Mechanism:

  • The VMM pool's alloc() path extends the pool via cuMemCreate + cuMemMap + cuMemSetAccess when an allocation can't fit in the existing mapped region.
  • cuMemSetAccess races with previously-queued async work and returns CUDA_ERROR_NOT_READY.
  • CU_CHECK fires GGML_ABORT, but in some process configurations the abort doesn't terminate (gdb-fork backtrace path interferes, SIGABRT swallowed somewhere — exact mechanism unclear, but reproducible).
  • The mapping is registered but access permissions are never set.
  • Subsequent reads/writes silently return zeroed/garbled data — for the prefix cache, snapshots get stored into the broken region and restore as garbage, making the model emit 0 tokens.

Fix is one cudaDeviceSynchronize() before the cuMem* sequence inside the pool extension branch:

if (size > avail) {
    // ... compute reserve_size ...
    GGML_ASSERT(pool_size + reserve_size <= CUDA_POOL_VMM_MAX_SIZE);

    CUDA_CHECK(cudaDeviceSynchronize());   // ← new

    // allocate more physical memory
    CU_CHECK(cuMemCreate(...));
    ...
}

Only fires when the pool actually needs to grow, so steady-state hot-path allocations are unaffected.

llama.cpp PR: Luce-Org/llama.cpp-dflash-ggml#4

This PR's submodule URL is temporarily pointed at easel's fork (fix/cuda-vmm-pool-extension-race branch) until that PR merges; revert to Luce-Org/llama.cpp.git@luce-dflash after merge.

Also bumps prefix_cache.startup_sync's await_reply timeout 10s → 60s for daemons with multi-slot snap pools at large --max-ctx.

Validation

repro_empty_response.py exits 0 (repro confirmed) pre-fix, exits 1 ("DID NOT REPRO") post-fix.

4-session sweep on RTX 3090 Ti, CUDA 13.2, Qwen 3.6 27B, --max-ctx=24576 --warm-slots=2:

session calls TTFT cold TTFT warm TTFT × wall × empties
lucebox-hub (≤11K char prefix) 10 40.0s 39.9s 1.00× 0.77× 0/10
nexiq-small (≤25K) 6 55.9s 44.9s 1.24× 0.75× 0/6
axon-med (≤39K) 10 133.8s 51.7s 2.59× 1.27× 0/10
helix-large (≤71K) 10 242.1s 97.3s 2.49× 1.52× 0/10

Zero empty responses across 36 warm calls. Cache wins scale with prefix size — 2.49–2.59× TTFT and 1.27–1.52× wall on real long-prefix agentic sessions is the headline win this PR (and PR Luce-Org#59) set out to validate. Short sessions barely break even, as expected (cache-creation overhead vs. limited reuse).

Test plan

  • repro_empty_response.py — pre-fix repros, post-fix doesn't
  • 4-session sweep — 36/36 warm calls real
  • Build: clean nvcc, no warnings
  • Cross-check on Blackwell / sm_120 — only validated on RTX 3090 Ti / sm_86
  • Confirm test_multi_turn_prefix_cache.py still passes — should, since it's checking turn-2-5 ratios on a small fixture that probably didn't trigger the bug anyway

Status

Draft. Two PRs need review:

Happy to split this PR if reviewer prefers (bench-only vs submodule-bump), but the bench changes don't make sense without the underlying fix being available.

🤖 Generated with Claude Code

mraxai and others added 5 commits April 29, 2026 14:05
Adds a snapshot/restore mechanism so the C++ daemon can preserve target
KV + SSM/conv + target_feat state across HTTP requests. Subsequent turns
in an agent loop that share a system prompt skip the system-prefill cost
(previously paid in full on every turn since the daemon called
free_target_cache + create_target_cache between requests).

C++ side
--------
- New PrefixSnapshot struct (internal.h): owns its own ggml_context +
  backend buffer, holds slim KV per layer + SSM/conv/target_feat per
  layer + cur_pos + last_tok + kv_k_type + max_ctx for sanity checks.
  Skips ssm_intermediate / conv_input_cache (within-decode rollback
  buffers, regenerated on first decode step after restore).
- snapshot_target_cache, restore_target_cache, free_prefix_snapshot in
  qwen35_target_graph.cpp using ggml_backend_tensor_copy. Lazy alloc
  (first SNAPSHOT call), reuse on subsequent refreshes.
- TargetCache gains a last_tok field, used solely by the prefix-cache
  bridge: when restored cur_pos == prompt_len the prefill loop runs
  zero iterations and the decode seed comes from the restored last_tok.

Daemon protocol (test_dflash.cpp)
---------------------------------
- Adds 4 new commands on stdin, dispatched before the legacy bare prompt
  line: SNAPSHOT N, RESTORE N <prompt> <n_gen>, FREE_SNAPSHOT N,
  LIST_SLOTS. Replies on stdout: [snap] slot=N cur_pos=P /
  [snap] freed slot=N / [snap] slots=A,B,C.
- prefill loop reads from cache.cur_pos as start (0 for fresh, >0 after
  restore). Restored cache + matching-length prompt -> zero-iter prefill,
  decode seeds from cache.last_tok.
- Hard cap of PREFIX_CACHE_SLOTS = 8 in the daemon.
- End-of-iteration writes cache.cur_pos = out_all.size() and
  cache.last_tok so the next SNAPSHOT command captures correct boundary.
- Frees all snapshot slots on daemon exit.

Python side
-----------
- New scripts/prefix_cache.py:
  * DaemonStdoutBus owns the stdout read loop, routes [snap]-prefixed
    lines to waiting coroutines, suppresses noisy [step]/[timing] logs.
  * PrefixCache stores hash -> slot_id LRU. lookup() returns
    (slot_id, prefix_len) or None. maybe_snapshot() does a SECOND
    n_gen=0 prefill of the prefix-only tokens, then SNAPSHOT — this
    aligns the snapshot's cur_pos exactly with the cache key's prefix
    length (one extra system prefill on cold turns, recovered many
    times over on subsequent warm turns).
  * find_prefix_boundary auto-detects the FIRST end-of-system-message
    boundary in Qwen chat templates, allowing one intervening newline
    token between im_end and im_start.
  * hash_prefix uses SHA-1 truncated to 16 bytes over (token ids,
    kv_k_type, fa_window).
  * DAEMON_MAX_SLOTS = 8 clamp; cap > limit emits a warning.

- server.py + server_tools.py:
  * --prefix-cache-slots N CLI flag (default 4, 0 disables).
  * Daemon spawn now uses stdout=PIPE so DaemonStdoutBus can route
    protocol replies.
  * Resolve effective KV-K type + fa_window from DFLASH27B_* env vars
    at daemon spawn time (mirrors C++ daemon's env parsing) and pass
    into PrefixCache so they're part of the hash key — daemon restart
    with different flags can't return stale state.
  * 4 lookup/maybe_snapshot call sites per file (stream + non-stream
    for /v1/chat/completions and /v1/messages). On miss send the bare
    prompt line, then maybe_snapshot drains via _drain_pipe_to_sentinel
    helper so the next protocol command is clean.

Verification
------------
- nm: new symbols snapshot_target_cache, restore_target_cache,
  free_prefix_snapshot in libdflash27b.a.
- C++ smoke (manual /tmp/smoke_restore.py):
  cold prompt n_gen=8 -> [a,b,c,d,e,f,g,h]
  cold same prompt n_gen=4 + SNAPSHOT 0 -> shared_4 = [a,b,c,d]
  RESTORE 0 + n_gen=4 -> warm_4 = [e,f,g,h]
  byte-equal continuation.
- End-to-end (test_server_prefix_cache.py): 5K-token system prompt,
  three turns at max_tokens=8.
  turn_1 9.87s (cold + snapshot warm-up)
  turn_2 0.48s  ratio_2/1 = 0.05
  turn_3 0.44s  ratio_3/1 = 0.04
  All replies non-empty and consistent. ~20x speedup on warm turns.

Reviewed by codex; this commit incorporates the two correctness
fixes flagged: hash inputs now use real env-var-derived values
instead of hardcoded "q8_0"/2048 literals, and Python cap is clamped
to the daemon's PREFIX_CACHE_SLOTS = 8 hard limit so configurations
above it can't cause silent SNAPSHOT failures. The third codex
finding (boundary detector won't handle tool-definition preambles
or multi-segment system messages) is documented as a follow-up
under server_tools.py — current detector covers the simple Qwen
system+user case; tool-using clients fall back to no-cache silently.

Plan file: ~/.claude/plans/yes-please-plan-for-luminous-pudding.md
Phase A (~1 week scope) of a 4-phase agentic-friendly KV/state plan.
Phase B (block-chain mid-conversation cache), Phase C (sliding KV
growth), Phase E (tool-loop incremental tokenization) are deferred
to follow-up commits.
Extends Phase A's single-point system-prompt cache to multi-slot LRU
that snapshots at every chat-template role boundary, so multi-turn
agent loops hit progressively deeper cached state on each new turn.

C++ side (B.1 + B.2)
--------------------
- PrefixSnapshot gains is_thin / kv_start / kv_end fields and two new
  primitives: snapshot_target_cache_thin and restore_target_cache_chain.
  Thin snapshots capture only KV slice [kv_start, kv_end); chain restore
  loads a thick base then layers thins. Implemented via per-strip H2D+D2H
  staging since ggml_backend_tensor_copy refuses views with mismatched
  layouts (verified by spike_thin_copy.cpp on Q8_0 / TQ3_0 / F16).
- Daemon protocol: SNAPSHOT_THIN N kv_start kv_end and RESTORE_CHAIN
  thick_slot thin_slots prompt_file n_gen. The thin/chain primitives
  remain unused by Phase B's actual flow (see "design pivot" below) but
  are kept for future block-chain extensions.

Design pivot
------------
Original plan called for a thick-anchor + thin-chain cache. On
implementation it became clear that thin snapshots only capture KV;
SSM/conv state can't be reconstructed from KV alone (DeltaNet recurrence
is non-replayable without re-running prefill). A chain restore would
land at the thick's cur_pos with valid SSM, then need DeltaNet replay
through the thin range — defeating the savings.

Pivoted to a simpler "multi-slot THICK LRU" design that delivers the
same user-visible win: cache full state at multiple block boundaries,
restore the deepest matching THICK on lookup, prefill only the new
suffix. Memory cost (4 thick slots × ~244 MB ≈ 1 GB) matches what the
thick+thin chain would have used.

Python side (B.3 + B.4)
-----------------------
- find_all_boundaries enumerates every <|im_end|><|im_start|> boundary
  after the system marker (allows up to 2 intervening tokens to handle
  the newline separator Qwen emits).
- PrefixCache.lookup walks all candidate cuts and returns the deepest
  cached match (longest-prefix); LRU touched on every hit.
- PrefixCache.maybe_snapshot iterates ALL boundaries on cache miss and
  snapshots each that's not already cached, evicting LRU when over cap.
- Each snapshot still uses Phase A's n_gen=0 prefill + SNAPSHOT pattern
  to land at the exact boundary cur_pos. Multi-snapshot increases
  cold-turn latency proportionally (e.g. 5-turn test: turn 1 13.5 s vs
  Phase A's ~10 s), but turns 2-5 all benefit.
- server.py / server_tools.py: zero changes — API surface stayed the
  same (lookup returns (slot, prefix_len) or None).

Tests (B.5)
-----------
- spike_thin_copy.cpp validates the per-strip staging-copy approach
  used by snapshot_target_cache_thin (works on Q8_0, TQ3_0, F16).
- test_multi_turn_prefix_cache.py: 5-turn agent loop, ~2K-token system
  prompt, growing history. RTX 3090 + Qwen3.6-27B-Q4_K_XL:
    turn 1  13.53 s  (cold + multi-snapshot warm-up)
    turn 2   0.55 s  ratio 0.04
    turn 3   0.70 s  ratio 0.05
    turn 4   0.85 s  ratio 0.06
    turn 5   1.23 s  ratio 0.09
  All warm turns < 30 % of cold turn 1; turn 5 still 11x faster than
  turn 1.
- Existing test_server_prefix_cache.py (3-turn shared system prompt)
  remains green: turn 2/3 at 3 % of turn 1.

Codex review of Phase A's hardcoded hash inputs and slot-cap mismatch
were addressed in the Phase A commit (e429894). Codex's third finding
(boundary detector won't handle tool-definition preambles in
server_tools.py) is still open and tracked as a follow-up; the new
find_all_boundaries inherits that limitation.

Bench branch: feat/prefix-cache (cumulative Phase A + B). Plan files at
~/.claude/plans/yes-please-plan-for-luminous-pudding.md (Phase A) and
~/.claude/plans/phase-b-block-chain-cache.md (Phase B, including the
design pivot rationale).
prepare_inline_snap was popping the LRU entry up-front so the daemon
could overwrite that slot.  If the request aborted before
confirm_inline_snap ran, the old entry was already gone AND the new
one was never registered, stranding a daemon slot until process
restart.  Reserve the slot via _pending_evict_key without removing
the old entry; pop + insert atomically in confirm_inline_snap.  Add
abort_inline_snap for explicit cancellation.

Also adds bench_agent_loop.py — replays real Claude Code session
JSONL turns through the dflash server with prefix-cache off vs on.
On 5 short real-session turns: turn-1 6.28x (page cache + warmup),
turns 2-5 ~equal because real-session prompts are too short for
prefix-cache to dominate.  The synthetic 2K-system test
(test_multi_turn_prefix_cache.py) is where the cache actually wins.

Both issues raised in the codex review of the Phase B + B.7 + B.8
work; the High one (last_tok=-1 after no-op restore prefill) was
already covered by the cache.last_tok bridge added earlier.
The current bench reads only `type=user` records where `content` is a
`str`. In a real Claude Code transcript every user record after the
first is a list of `tool_result` blocks with `content` as a `list` —
all silently skipped. The bench replays only typed human prompts with
bench-synthesised assistant replies in between, with no tool I/O.

Tool I/O is the bulk of an agentic prefix: typical prefix grows from
~5K chars at turn 1 to 60-300K by turn 30. Validated against a real
session (32 assistant turns, ~95K chars at the last call): the old
loader extracts 7 typed-user turns; the new loader walks all 32 call
points with faithful prefix growth. The PR's own commit message
("Real-session prompts are too short for the cache to dominate; cold
and warm turns 2-5 are within noise") is a measurement artefact of
the loader, not a property of the cache.

Replace `extract_user_turns()` with a transcript loader that:

1. Coalesces consecutive same-role records into single turns (one
   logical LLM turn = N JSONL rows).
2. Converts Anthropic blocks → OpenAI messages: text → content,
   tool_use → assistant.tool_calls, tool_result → tool message,
   thinking dropped.
3. At each assistant index, sends the exact prefix that was sent at
   that point (system + everything before this assistant turn, tool
   I/O included) and advances state via the recorded assistant turn,
   not a bench-synthesised one.

Also switch the chat call to streaming SSE so we measure TTFT
separately from total wall — TTFT is what the prefix cache
accelerates, total wall mixes prefill speedup with decode rate.

Preserved: cold (slots=0) vs warm (slots=N) dual-server structure,
--turns / --session interface, per-turn ratio table (now with TTFT
and wall columns).

Default session dir derives the workspace from CWD (replace `/` with
`-`) instead of hardcoding `-home-peppi-Dev-lucebox-hub`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups after running v1 against the PR's server:

1. The server's ChatRequest/ChatMessage schema requires `content` and
   ignores `tool_calls`/role=tool, so emitting structured tool messages
   produces 422s on every call after the first. Flatten tool_use →
   `<tool_call name=X>{json args}</tool_call>` text in assistant
   content; tool_result → `<tool_response id=X>...</tool_response>`
   text in user content. One message per turn, role in
   {system,user,assistant}, content always a string. Same on-wire
   prefix bytes (which is what the cache cares about), runs cleanly
   against the PR's server.

2. Add a discarded warmup call before the timed loop. Without this
   the first cold call eats ~95s of CUDA graph capture / kernel JIT
   one-time cost and dominates the totals.

3. Restore PR Luce-Org#59's `"You are a precise coding assistant..."` system
   prompt at message[0]. Realistic shape, deterministic prefix.

4. Bump default --n-gen 8 → 64. Qwen 3.6 is a thinking model that
   spends tokens in `reasoning_content`; 8 was too tight to ever emit
   a completion token. (Headline metric is TTFT regardless, but a
   non-zero n_tok lets us report decode tok/s when present.)

Numbers from a 10-turn replay of a real session
(15 → 11,350 chars, RTX 3090 Ti, Q4_K_M):
  total TTFT: 79.90s cold → 38.02s warm = 2.10x
  total wall: 97.62s cold → 93.30s warm = 1.05x  (decode-bound)
  best call: 6.58s TTFT cold → 1.61s warm = 4.09x at 11K-char prefix

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel
Copy link
Copy Markdown
Author

easel commented Apr 30, 2026

Ran the bench against PR Luce-Org#59's server. Real session, 10 turns, RTX 3090 Ti, Qwen 3.6 27B Q4_K_M, max-ctx=16384, slots=0 vs slots=4.

Headline

metric cold warm speedup
total TTFT 79.90s 38.02s 2.10×
total wall 97.62s 93.30s 1.05×
best call (call 10, 11.3K chars) 6.58s TTFT 1.61s 4.09×

Wall barely moves because Qwen 3.6 is a thinking model: with --n-gen 64 every call burns the full budget in reasoning_content (n_tok=0 reported) and the wall is dominated by reasoning decode, which the prefix cache doesn't accelerate. TTFT — which is what the cache actually targets — halves overall and shows the expected "deeper prefix → bigger win" pattern at call 10.

Per-call

call  in_chars  cold ttft  warm ttft  ttft x   cold wall  warm wall  wall x
   1        93     9640ms      828ms  11.65x      11.19s      2.54s   4.41x
   2     1,236    11647ms     3393ms   3.43x      12.60s      5.12s   2.46x
   3     1,447     2231ms     4885ms   0.46x       3.23s      6.75s   0.48x
   4     1,633    13245ms     2128ms   6.23x      14.75s     11.84s   1.25x
   5     2,020     3136ms     2106ms   1.49x       4.60s     10.47s   0.44x
   6     4,431     7219ms     6615ms   1.09x       9.51s     15.32s   0.62x
   7     5,236     9580ms     6234ms   1.54x      11.51s     12.18s   0.94x
   8    10,504     6279ms     4969ms   1.26x       8.94s     10.20s   0.88x
   9    10,753    10347ms     5256ms   1.97x      12.67s     10.63s   1.19x
  10    11,350     6579ms     1607ms   4.09x       8.61s      8.23s   1.05x

Per-call is noisy (calls 3 and 5 are slower warm than cold) — single run, shared GPU with the host, decode variance from a thinking model. Aggregate is the meaningful signal.

Three changes pushed in 7c182c9

While running it I hit three issues with v1; followed up with bench(prefix-cache): flatten tool I/O + warmup + system prompt:

  1. Tool I/O flattening. Your ChatRequest/ChatMessage schema requires content and ignores tool_calls / role=tool. v1 emitted structured tool messages and got a 422 on every call after the first. v2 flattens: tool_use<tool_call name=X>{args}</tool_call> text inside the assistant content, tool_result<tool_response id=X>...</tool_response> text inside the user content. Single message per turn, role in {system,user,assistant}. Token counts on the wire stay close to the original (which is what the cache cares about) and the chat template wraps each turn the same way. When the server gains tool support upstream, this can cleanly switch back to structured messages — happy to do that follow-up.
  2. Warmup call. First cold call was eating ~95s of one-time CUDA graph / kernel JIT cost. Now there's a discarded max_tokens=1 warmup before the timed loop on each server.
  3. System prompt restored. v1 dropped your default "You are a precise coding assistant..." system message. Restored at messages[0] for realistic prefix shape.

Also bumped default --n-gen 8 → 64 so a thinking model has room to emit a completion token after reasoning. Doesn't change any TTFT numbers but unlocks a real decode tok/s reading on non-thinking models.

Still draft pending your review of the approach.

easel and others added 2 commits April 30, 2026 14:26
…k fallback

Three follow-ups after running v2 end-to-end:

1. Point at server_tools.py instead of server.py. server_tools is the
   production agent-CLI endpoint and has the prefix cache fully wired
   in (lookup / prepare_inline_snap / confirm_inline_snap at all four
   /v1 handlers). server.py doesn't accept tool_calls in its schema,
   which led v2 to flatten tool I/O into <tool_call>/<tool_response>
   text — that worked but obscured what the bench was actually
   measuring.

2. Revert the flattening hack in _to_openai_messages. Now emits
   proper structured tool messages:
     tool_use   → assistant.tool_calls[].function.{name,arguments}
     tool_result → role="tool" message with tool_call_id
   server_tools accepts this directly (ChatMessage.content: Any | None,
   tool_calls + tool_call_id fields). What's on the wire matches what
   real OpenAI-compat agent CLIs send, so the bench measures the path
   that production traffic actually takes.

3. Token-count fallback in _stream_chat. PR Luce-Org#59's server does NOT
   honour stream_options.include_usage — no usage chunk is ever
   emitted on /v1/chat/completions. Without a fallback every call
   reports n_tok=0 even when 64 content deltas streamed. Now we
   prefer usage.completion_tokens when present, otherwise count
   content/reasoning/tool deltas as a proxy.

System prompt and warmup-call still in place from the previous
commit; --n-gen default still 64.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained reproducer for the multi-slot inline-snap regression
in prefix_cache.py + qwen35_target_graph.cpp. No transcript dependency:
generates a 6-call growing-prefix sequence with synthetic pylint-style
tool results, runs against slots=0 (control) then slots=2 (repro),
prints per-call content/reasoning/finish + a side-by-side table.

Trigger profile: starting at the second multi-turn call (~5K char
prefix), warm responses become content_len=0 / comp_tok=0 /
finish_reason=stop; subsequent calls return in <50 ms, also empty.
Cold path on the same prompts produces 64 tokens per call.

Suppresses the GGML gdb-fork backtrace handler via GGML_NO_BACKTRACE=1
so the daemon log stays readable when ggml-cuda hits its
"device not ready" error path during the lazy snap-buffer alloc.

Usage:
  python3 dflash/scripts/repro_empty_response.py \
    --target /path/to/Qwen3.6-27B*.gguf \
    --draft  /path/to/qwen3.6-27b-dflash \
    --bin    /path/to/dflash/build/test_dflash \
    --server /path/to/dflash/scripts/server_tools.py

Exits 0 on confirmed repro, 1 if every warm call produced output
(threshold not reached — bump --n-turns / --tool-chars), 2 if cold
itself returned empty (different problem).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel easel changed the title bench(prefix-cache): faithful Claude Code transcript replay + TTFT bench: faithful Claude Code transcript replay + repro & fix for multi-slot empty-response bug Apr 30, 2026
@easel easel force-pushed the bench-agent-loop-faithful-replay branch from 6d74d18 to 81e12e4 Compare May 1, 2026 03:08
@easel easel changed the title bench: faithful Claude Code transcript replay + repro & fix for multi-slot empty-response bug bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug May 1, 2026
…odule

Bumps the llama.cpp submodule to a fix that addresses the prefix-cache
empty-response bug at its root: ggml-cuda's VMM allocator's pool
extension via cuMemSetAccess races with in-flight async work and
returns CUDA_ERROR_NOT_READY. The CU_CHECK macro hits GGML_ABORT but
the abort doesn't actually terminate, leaving the just-mapped region
without access permissions. Every subsequent read/write into that
region silently misbehaves — for the prefix cache, snapshots of KV
state get stored into the broken region and restore as zeroed/garbled
state, making the model emit 0 tokens with finish_reason=stop.

Manifests on PR Luce-Org#59's inline-snap path because it interleaves compute
with allocations on the same backend (snapshot copies during prefill
followed by gallocr / rollback / cache rebuild allocations). The fix
adds a cudaDeviceSynchronize before the cuMem* sequence in the pool
extension branch — only fires when the pool actually grows, so
steady-state hot-path allocations are unaffected.

llama.cpp PR: Luce-Org/llama.cpp-dflash-ggml#4

Submodule URL temporarily pointed at easel's fork (branch
fix/cuda-vmm-pool-extension-race) until the upstream PR merges. After
merge, revert .gitmodules to Luce-Org/llama.cpp.git@luce-dflash and
bump the submodule pointer to the merge commit.

Also bumps prefix_cache.startup_sync's await_reply timeout 10s → 60s
for daemons with multi-slot snap pools at large max-ctx.

Validated on RTX 3090 Ti, CUDA 13.2, Qwen 3.6 27B, max-ctx=24576,
slots=2:

  session       turns   TTFT cold  TTFT warm  TTFT x  wall x  empties
  lucebox-hub      10      40.0s      39.9s    1.00x   0.77x   0/10
  nexiq-small       6      55.9s      44.9s    1.24x   0.75x    0/6
  axon-med         10     133.8s      51.7s    2.59x   1.27x   0/10
  helix-large      10     242.1s      97.3s    2.49x   1.52x   0/10

36/36 warm calls produce real content. Cache delivers 2.49–2.59x TTFT
speedup on long agentic prefixes (38K–70K chars) — the headline win
this PR set out to validate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@easel easel force-pushed the bench-agent-loop-faithful-replay branch from 81e12e4 to dedaf69 Compare May 1, 2026 03:44
@dusterbloom dusterbloom force-pushed the feat/prefix-cache branch from b597e8f to 7e143f9 Compare May 1, 2026 16:28
@davide221 davide221 force-pushed the feat/prefix-cache branch from e216877 to cbdf9c8 Compare May 3, 2026 11:25
dusterbloom added a commit that referenced this pull request May 9, 2026
…s layers

Cross-attention with TQ3_0 KV cache produced accept_rate=0 because
three separate issues compounded:

1. K/V views were cast from TQ3_0 to F16/F32 before ggml_flash_attn_ext.
   The CUDA FA kernels apply forward FWHT to Q (and inverse FWHT to
   the output) only when they observe K->type == GGML_TYPE_TQ3_0
   (fattn-chunked.cu:228,394; fattn-vec.cuh:168). Casting stripped the
   type tag, FA picked a non-WHT kernel, and Q (real domain) dotted
   with K (FWHT domain, just unpacked into F16) produced meaningless
   scores. Removed the cast; Kfa/Vfa now reach FA with native TQ3_0.

2. TQ3_0 K is iterated in 128-element block strides; an unaligned
   ne[1] reads past the valid window into stale cache cells.
   Previously we only padded for head_dim>=512; SWA layers
   (head_dim=256) skipped padding and silently corrupted attention.
   Extended needs_kv_pad to fire for any TQ3_0 cache, mirroring
   gemma4_target_graph.cpp's need_256_pad policy.

3. Each layer created its own FA mask input tensor but only the last
   one was exposed via out.fa_mask. After fix #2 all four layers
   needed masks; the unfilled mask buffers contained uninitialised
   CUDA memory (cudaMalloc is not zeroed), causing NaN logits on
   subsequent steps. Hoisted a single shared mask out of the
   per-layer loop. The builder now asserts that all need-mask layers
   want the same (width, kv_seq_len) and fails loudly if a future
   long-context build wants per-layer masks (SWA cap < full
   attn_pos), instead of silently doing the wrong thing.

Trajectory:
  pre-fix:      accept_rate = 0.00 (varying garbage tokens)
  fix #1 only:  accept_rate = 0.00 (drafts pinned to a single token)
  fix #1+#2:    step 1 OK, step 2+ NaN
  fix #1+#2+Luce-Org#3: accept_rate = 0.22 (Q4_K_M target + Q8_0 assistant,
                TQ3_0 KV, 131-token prompt, 64 generation steps)

Adjacent infrastructure:
- create_gemma4_cache(): extra_q8_layers param to force Q8_0 on
  specific MTP donor layers when needed.
- get_mtp_swa_pattern(): lightweight helper reading MTP SWA layout
  from GGUF without loading tensors.
- MTP loader: load centroids/token_ordering whenever n_centroids>0
  (graph builder decides whether to use them).
- Test caller: fills out.fa_mask before each compute; dropped the
  per-step diagnostic prints that are no longer needed.

Known follow-ups (not blocking):
- Long-context multi-mask: SWA cap < full attn_pos trips the assert.
- SWA-wrap branch concat-forces F32 on TQ3_0, losing the WHT path.
- Accept rate 0.22 is in expected range; remaining gap to spike's
  reference numbers may come from quantization, RoPE source, or
  attention scale.
dusterbloom added a commit that referenced this pull request May 16, 2026
…ync stream)

#1: recommit verify wrote accept_n+2 KV slots vs fast-path's accept_n+1;
    runner's base_pos += accept_n+2 then re-verified bonus at its own slot,
    skipping every Dth token at γ≥3 partial-accept (reproduced AR/spec
    divergence in new test_recommit_byte_identical_to_ar).
#2: capture_topology_for_chain() virtual; runner owns the call.
    verify_batch no longer mutates last_tree_*.
Luce-Org#3: dedicated rollback CUDA stream + cudaMemcpy2DAsync batching in
    restore_kv_at_dfs (4× fewer launches per layer).

Bug Luce-Org#5 (step_sg_cache O(n_ctx)) deferred — needs ggml_set_rows refactor.
dusterbloom added a commit that referenced this pull request May 16, 2026
…ync stream)

#1: recommit verify wrote accept_n+2 KV slots vs fast-path's accept_n+1;
    runner's base_pos += accept_n+2 then re-verified bonus at its own slot,
    skipping every Dth token at γ≥3 partial-accept (reproduced AR/spec
    divergence in new test_recommit_byte_identical_to_ar).
#2: capture_topology_for_chain() virtual; runner owns the call.
    verify_batch no longer mutates last_tree_*.
Luce-Org#3: dedicated rollback CUDA stream + cudaMemcpy2DAsync batching in
    restore_kv_at_dfs (4× fewer launches per layer).

Bug Luce-Org#5 (step_sg_cache O(n_ctx)) deferred — needs ggml_set_rows refactor.
dusterbloom added a commit that referenced this pull request May 17, 2026
…ync stream)

#1: recommit verify wrote accept_n+2 KV slots vs fast-path's accept_n+1;
    runner's base_pos += accept_n+2 then re-verified bonus at its own slot,
    skipping every Dth token at γ≥3 partial-accept (reproduced AR/spec
    divergence in new test_recommit_byte_identical_to_ar).
#2: capture_topology_for_chain() virtual; runner owns the call.
    verify_batch no longer mutates last_tree_*.
Luce-Org#3: dedicated rollback CUDA stream + cudaMemcpy2DAsync batching in
    restore_kv_at_dfs (4× fewer launches per layer).

Bug Luce-Org#5 (step_sg_cache O(n_ctx)) deferred — needs ggml_set_rows refactor.
dusterbloom added a commit that referenced this pull request May 18, 2026
…and_decode (step 3.1)

Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208,
Luce-Org#207 already merged) plus the first slice of howard0su's PR Luce-Org#214 review
request: move MTP orchestration into dflash/src/common/ behind a generic
entry point any ModelBackend can call.

## What landed

### Foundation (rebase port, ~5k LOC)

- `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads
  MTP module (Qwen36MtpModule, implements INativeMtp)
- `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph
- `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF
- `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins
- `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner
- `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h`
  + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure
- `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional
  Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_,
  do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3)
- `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields
- `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp`
  — hidden-sequence capture path for MTP head warming
- `dflash/test/test_dflash.cpp` — daemon dispatch routes
  `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved)
- `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source`
  CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's
  thinking-default fixes (commit 998b280) without conflict

### Step 3.1 — common::mtp::warm_and_decode entry point (TDD red→green)

Howard's review:
> "MTP should be simple as additional weights of modelbackend. If a model
>  contains MTP support (gemma4 or qwen3.5), the logic can handle it. In
>  other words, the logic should be in /common which can potentially
>  leverage by any modelbackend if they support mtp."

Carved out the public surface for the future orchestrator:

  GenerateResult dflash27b::common::mtp::warm_and_decode(
      ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io);

New files:
- `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature,
  cpp is a minimal stub that only handles guard cases (null backend, no
  MTP support, empty prompt). Real warm + decode body lands in step 3.2,
  driven by additional red→green tests.
- `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests
  written and watched fail BEFORE the stub existed (compile-time RED:
  "common/mtp_orchestrator.h: No such file or directory"), then GREEN
  after the stub returned matching error strings.

Test results:
  T1 null_backend PASS
  T2 backend_without_mtp PASS
  T3 empty_prompt PASS
  ALL PASS

## Steps 3.2-3.5 (separate commits, this PR)

3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch
    + hidden capture + MtpChainRunner.run); red test = identical token IDs
    vs reference run_qwen36_mtp_harness on a fixed prompt.
3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to
    common::mtp::warm_and_decode; delete the qwen35-local helpers.
3.4 stub Gemma4Backend MTP override using the same common entry point to
    prove the interface is generic (not Qwen35-specific).
3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled
    CPU loops; replace with ggml primitives per howard's point #1.

Then retest 24K baseline post-RoPE-fix (howard's other comment) and update
PR description with current numbers.

Addresses:
- davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased
- howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
dusterbloom added a commit that referenced this pull request May 18, 2026
…and_decode (step 3.1)

Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208,
request: move MTP orchestration into dflash/src/common/ behind a generic
entry point any ModelBackend can call.

- `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads
  MTP module (Qwen36MtpModule, implements INativeMtp)
- `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph
- `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF
- `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins
- `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner
- `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h`
  + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure
- `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional
  Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_,
  do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3)
- `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields
- `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp`
  — hidden-sequence capture path for MTP head warming
- `dflash/test/test_dflash.cpp` — daemon dispatch routes
  `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved)
- `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source`
  CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's
  thinking-default fixes (commit 998b280) without conflict

Howard's review:
> "MTP should be simple as additional weights of modelbackend. If a model
>  contains MTP support (gemma4 or qwen3.5), the logic can handle it. In
>  other words, the logic should be in /common which can potentially
>  leverage by any modelbackend if they support mtp."

Carved out the public surface for the future orchestrator:

  GenerateResult dflash27b::common::mtp::warm_and_decode(
      ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io);

New files:
- `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature,
  cpp is a minimal stub that only handles guard cases (null backend, no
  MTP support, empty prompt). Real warm + decode body lands in step 3.2,
  driven by additional red→green tests.
- `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests
  written and watched fail BEFORE the stub existed (compile-time RED:
  "common/mtp_orchestrator.h: No such file or directory"), then GREEN
  after the stub returned matching error strings.

Test results:
  T1 null_backend PASS
  T2 backend_without_mtp PASS
  T3 empty_prompt PASS
  ALL PASS

3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch
    + hidden capture + MtpChainRunner.run); red test = identical token IDs
    vs reference run_qwen36_mtp_harness on a fixed prompt.
3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to
    common::mtp::warm_and_decode; delete the qwen35-local helpers.
3.4 stub Gemma4Backend MTP override using the same common entry point to
    prove the interface is generic (not Qwen35-specific).
3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled
    CPU loops; replace with ggml primitives per howard's point #1.

Then retest 24K baseline post-RoPE-fix (howard's other comment) and update
PR description with current numbers.

Addresses:
- davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased
- howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
dusterbloom added a commit that referenced this pull request May 23, 2026
… (seeds #1, #2)

- StubServer: ThreadingHTTPServer recorder, zero new deps (mirrors llamacpp_compat_proxy.py pattern)
- Seed #2 green: proxy injects session_id on /v1/messages, preserves existing, passes through GET
- Seed #1 documented: chat/completions round-trip passes; injection assertion commented out pending commit 3
dusterbloom added a commit that referenced this pull request May 23, 2026
…#1)

- Add INJECT_ROUTES frozenset: /v1/messages, /v1/chat/completions, /v1/responses
- do_POST checks route_base in INJECT_ROUTES (query-string-safe)
- Seed #1 green: chat/completions round-trip injects session_id
- Add /v1/responses injection test (codex route)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants