bench: faithful Claude Code transcript replay + repro & build-flag fix for empty-response bug#1
Conversation
Adds a snapshot/restore mechanism so the C++ daemon can preserve target
KV + SSM/conv + target_feat state across HTTP requests. Subsequent turns
in an agent loop that share a system prompt skip the system-prefill cost
(previously paid in full on every turn since the daemon called
free_target_cache + create_target_cache between requests).
C++ side
--------
- New PrefixSnapshot struct (internal.h): owns its own ggml_context +
backend buffer, holds slim KV per layer + SSM/conv/target_feat per
layer + cur_pos + last_tok + kv_k_type + max_ctx for sanity checks.
Skips ssm_intermediate / conv_input_cache (within-decode rollback
buffers, regenerated on first decode step after restore).
- snapshot_target_cache, restore_target_cache, free_prefix_snapshot in
qwen35_target_graph.cpp using ggml_backend_tensor_copy. Lazy alloc
(first SNAPSHOT call), reuse on subsequent refreshes.
- TargetCache gains a last_tok field, used solely by the prefix-cache
bridge: when restored cur_pos == prompt_len the prefill loop runs
zero iterations and the decode seed comes from the restored last_tok.
Daemon protocol (test_dflash.cpp)
---------------------------------
- Adds 4 new commands on stdin, dispatched before the legacy bare prompt
line: SNAPSHOT N, RESTORE N <prompt> <n_gen>, FREE_SNAPSHOT N,
LIST_SLOTS. Replies on stdout: [snap] slot=N cur_pos=P /
[snap] freed slot=N / [snap] slots=A,B,C.
- prefill loop reads from cache.cur_pos as start (0 for fresh, >0 after
restore). Restored cache + matching-length prompt -> zero-iter prefill,
decode seeds from cache.last_tok.
- Hard cap of PREFIX_CACHE_SLOTS = 8 in the daemon.
- End-of-iteration writes cache.cur_pos = out_all.size() and
cache.last_tok so the next SNAPSHOT command captures correct boundary.
- Frees all snapshot slots on daemon exit.
Python side
-----------
- New scripts/prefix_cache.py:
* DaemonStdoutBus owns the stdout read loop, routes [snap]-prefixed
lines to waiting coroutines, suppresses noisy [step]/[timing] logs.
* PrefixCache stores hash -> slot_id LRU. lookup() returns
(slot_id, prefix_len) or None. maybe_snapshot() does a SECOND
n_gen=0 prefill of the prefix-only tokens, then SNAPSHOT — this
aligns the snapshot's cur_pos exactly with the cache key's prefix
length (one extra system prefill on cold turns, recovered many
times over on subsequent warm turns).
* find_prefix_boundary auto-detects the FIRST end-of-system-message
boundary in Qwen chat templates, allowing one intervening newline
token between im_end and im_start.
* hash_prefix uses SHA-1 truncated to 16 bytes over (token ids,
kv_k_type, fa_window).
* DAEMON_MAX_SLOTS = 8 clamp; cap > limit emits a warning.
- server.py + server_tools.py:
* --prefix-cache-slots N CLI flag (default 4, 0 disables).
* Daemon spawn now uses stdout=PIPE so DaemonStdoutBus can route
protocol replies.
* Resolve effective KV-K type + fa_window from DFLASH27B_* env vars
at daemon spawn time (mirrors C++ daemon's env parsing) and pass
into PrefixCache so they're part of the hash key — daemon restart
with different flags can't return stale state.
* 4 lookup/maybe_snapshot call sites per file (stream + non-stream
for /v1/chat/completions and /v1/messages). On miss send the bare
prompt line, then maybe_snapshot drains via _drain_pipe_to_sentinel
helper so the next protocol command is clean.
Verification
------------
- nm: new symbols snapshot_target_cache, restore_target_cache,
free_prefix_snapshot in libdflash27b.a.
- C++ smoke (manual /tmp/smoke_restore.py):
cold prompt n_gen=8 -> [a,b,c,d,e,f,g,h]
cold same prompt n_gen=4 + SNAPSHOT 0 -> shared_4 = [a,b,c,d]
RESTORE 0 + n_gen=4 -> warm_4 = [e,f,g,h]
byte-equal continuation.
- End-to-end (test_server_prefix_cache.py): 5K-token system prompt,
three turns at max_tokens=8.
turn_1 9.87s (cold + snapshot warm-up)
turn_2 0.48s ratio_2/1 = 0.05
turn_3 0.44s ratio_3/1 = 0.04
All replies non-empty and consistent. ~20x speedup on warm turns.
Reviewed by codex; this commit incorporates the two correctness
fixes flagged: hash inputs now use real env-var-derived values
instead of hardcoded "q8_0"/2048 literals, and Python cap is clamped
to the daemon's PREFIX_CACHE_SLOTS = 8 hard limit so configurations
above it can't cause silent SNAPSHOT failures. The third codex
finding (boundary detector won't handle tool-definition preambles
or multi-segment system messages) is documented as a follow-up
under server_tools.py — current detector covers the simple Qwen
system+user case; tool-using clients fall back to no-cache silently.
Plan file: ~/.claude/plans/yes-please-plan-for-luminous-pudding.md
Phase A (~1 week scope) of a 4-phase agentic-friendly KV/state plan.
Phase B (block-chain mid-conversation cache), Phase C (sliding KV
growth), Phase E (tool-loop incremental tokenization) are deferred
to follow-up commits.
Extends Phase A's single-point system-prompt cache to multi-slot LRU
that snapshots at every chat-template role boundary, so multi-turn
agent loops hit progressively deeper cached state on each new turn.
C++ side (B.1 + B.2)
--------------------
- PrefixSnapshot gains is_thin / kv_start / kv_end fields and two new
primitives: snapshot_target_cache_thin and restore_target_cache_chain.
Thin snapshots capture only KV slice [kv_start, kv_end); chain restore
loads a thick base then layers thins. Implemented via per-strip H2D+D2H
staging since ggml_backend_tensor_copy refuses views with mismatched
layouts (verified by spike_thin_copy.cpp on Q8_0 / TQ3_0 / F16).
- Daemon protocol: SNAPSHOT_THIN N kv_start kv_end and RESTORE_CHAIN
thick_slot thin_slots prompt_file n_gen. The thin/chain primitives
remain unused by Phase B's actual flow (see "design pivot" below) but
are kept for future block-chain extensions.
Design pivot
------------
Original plan called for a thick-anchor + thin-chain cache. On
implementation it became clear that thin snapshots only capture KV;
SSM/conv state can't be reconstructed from KV alone (DeltaNet recurrence
is non-replayable without re-running prefill). A chain restore would
land at the thick's cur_pos with valid SSM, then need DeltaNet replay
through the thin range — defeating the savings.
Pivoted to a simpler "multi-slot THICK LRU" design that delivers the
same user-visible win: cache full state at multiple block boundaries,
restore the deepest matching THICK on lookup, prefill only the new
suffix. Memory cost (4 thick slots × ~244 MB ≈ 1 GB) matches what the
thick+thin chain would have used.
Python side (B.3 + B.4)
-----------------------
- find_all_boundaries enumerates every <|im_end|><|im_start|> boundary
after the system marker (allows up to 2 intervening tokens to handle
the newline separator Qwen emits).
- PrefixCache.lookup walks all candidate cuts and returns the deepest
cached match (longest-prefix); LRU touched on every hit.
- PrefixCache.maybe_snapshot iterates ALL boundaries on cache miss and
snapshots each that's not already cached, evicting LRU when over cap.
- Each snapshot still uses Phase A's n_gen=0 prefill + SNAPSHOT pattern
to land at the exact boundary cur_pos. Multi-snapshot increases
cold-turn latency proportionally (e.g. 5-turn test: turn 1 13.5 s vs
Phase A's ~10 s), but turns 2-5 all benefit.
- server.py / server_tools.py: zero changes — API surface stayed the
same (lookup returns (slot, prefix_len) or None).
Tests (B.5)
-----------
- spike_thin_copy.cpp validates the per-strip staging-copy approach
used by snapshot_target_cache_thin (works on Q8_0, TQ3_0, F16).
- test_multi_turn_prefix_cache.py: 5-turn agent loop, ~2K-token system
prompt, growing history. RTX 3090 + Qwen3.6-27B-Q4_K_XL:
turn 1 13.53 s (cold + multi-snapshot warm-up)
turn 2 0.55 s ratio 0.04
turn 3 0.70 s ratio 0.05
turn 4 0.85 s ratio 0.06
turn 5 1.23 s ratio 0.09
All warm turns < 30 % of cold turn 1; turn 5 still 11x faster than
turn 1.
- Existing test_server_prefix_cache.py (3-turn shared system prompt)
remains green: turn 2/3 at 3 % of turn 1.
Codex review of Phase A's hardcoded hash inputs and slot-cap mismatch
were addressed in the Phase A commit (e429894). Codex's third finding
(boundary detector won't handle tool-definition preambles in
server_tools.py) is still open and tracked as a follow-up; the new
find_all_boundaries inherits that limitation.
Bench branch: feat/prefix-cache (cumulative Phase A + B). Plan files at
~/.claude/plans/yes-please-plan-for-luminous-pudding.md (Phase A) and
~/.claude/plans/phase-b-block-chain-cache.md (Phase B, including the
design pivot rationale).
prepare_inline_snap was popping the LRU entry up-front so the daemon could overwrite that slot. If the request aborted before confirm_inline_snap ran, the old entry was already gone AND the new one was never registered, stranding a daemon slot until process restart. Reserve the slot via _pending_evict_key without removing the old entry; pop + insert atomically in confirm_inline_snap. Add abort_inline_snap for explicit cancellation. Also adds bench_agent_loop.py — replays real Claude Code session JSONL turns through the dflash server with prefix-cache off vs on. On 5 short real-session turns: turn-1 6.28x (page cache + warmup), turns 2-5 ~equal because real-session prompts are too short for prefix-cache to dominate. The synthetic 2K-system test (test_multi_turn_prefix_cache.py) is where the cache actually wins. Both issues raised in the codex review of the Phase B + B.7 + B.8 work; the High one (last_tok=-1 after no-op restore prefill) was already covered by the cache.last_tok bridge added earlier.
The current bench reads only `type=user` records where `content` is a
`str`. In a real Claude Code transcript every user record after the
first is a list of `tool_result` blocks with `content` as a `list` —
all silently skipped. The bench replays only typed human prompts with
bench-synthesised assistant replies in between, with no tool I/O.
Tool I/O is the bulk of an agentic prefix: typical prefix grows from
~5K chars at turn 1 to 60-300K by turn 30. Validated against a real
session (32 assistant turns, ~95K chars at the last call): the old
loader extracts 7 typed-user turns; the new loader walks all 32 call
points with faithful prefix growth. The PR's own commit message
("Real-session prompts are too short for the cache to dominate; cold
and warm turns 2-5 are within noise") is a measurement artefact of
the loader, not a property of the cache.
Replace `extract_user_turns()` with a transcript loader that:
1. Coalesces consecutive same-role records into single turns (one
logical LLM turn = N JSONL rows).
2. Converts Anthropic blocks → OpenAI messages: text → content,
tool_use → assistant.tool_calls, tool_result → tool message,
thinking dropped.
3. At each assistant index, sends the exact prefix that was sent at
that point (system + everything before this assistant turn, tool
I/O included) and advances state via the recorded assistant turn,
not a bench-synthesised one.
Also switch the chat call to streaming SSE so we measure TTFT
separately from total wall — TTFT is what the prefix cache
accelerates, total wall mixes prefill speedup with decode rate.
Preserved: cold (slots=0) vs warm (slots=N) dual-server structure,
--turns / --session interface, per-turn ratio table (now with TTFT
and wall columns).
Default session dir derives the workspace from CWD (replace `/` with
`-`) instead of hardcoding `-home-peppi-Dev-lucebox-hub`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three follow-ups after running v1 against the PR's server:
1. The server's ChatRequest/ChatMessage schema requires `content` and
ignores `tool_calls`/role=tool, so emitting structured tool messages
produces 422s on every call after the first. Flatten tool_use →
`<tool_call name=X>{json args}</tool_call>` text in assistant
content; tool_result → `<tool_response id=X>...</tool_response>`
text in user content. One message per turn, role in
{system,user,assistant}, content always a string. Same on-wire
prefix bytes (which is what the cache cares about), runs cleanly
against the PR's server.
2. Add a discarded warmup call before the timed loop. Without this
the first cold call eats ~95s of CUDA graph capture / kernel JIT
one-time cost and dominates the totals.
3. Restore PR Luce-Org#59's `"You are a precise coding assistant..."` system
prompt at message[0]. Realistic shape, deterministic prefix.
4. Bump default --n-gen 8 → 64. Qwen 3.6 is a thinking model that
spends tokens in `reasoning_content`; 8 was too tight to ever emit
a completion token. (Headline metric is TTFT regardless, but a
non-zero n_tok lets us report decode tok/s when present.)
Numbers from a 10-turn replay of a real session
(15 → 11,350 chars, RTX 3090 Ti, Q4_K_M):
total TTFT: 79.90s cold → 38.02s warm = 2.10x
total wall: 97.62s cold → 93.30s warm = 1.05x (decode-bound)
best call: 6.58s TTFT cold → 1.61s warm = 4.09x at 11K-char prefix
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Ran the bench against PR Luce-Org#59's server. Real session, 10 turns, RTX 3090 Ti, Qwen 3.6 27B Q4_K_M, Headline
Wall barely moves because Qwen 3.6 is a thinking model: with Per-callPer-call is noisy (calls 3 and 5 are slower warm than cold) — single run, shared GPU with the host, decode variance from a thinking model. Aggregate is the meaningful signal. Three changes pushed in
|
…k fallback
Three follow-ups after running v2 end-to-end:
1. Point at server_tools.py instead of server.py. server_tools is the
production agent-CLI endpoint and has the prefix cache fully wired
in (lookup / prepare_inline_snap / confirm_inline_snap at all four
/v1 handlers). server.py doesn't accept tool_calls in its schema,
which led v2 to flatten tool I/O into <tool_call>/<tool_response>
text — that worked but obscured what the bench was actually
measuring.
2. Revert the flattening hack in _to_openai_messages. Now emits
proper structured tool messages:
tool_use → assistant.tool_calls[].function.{name,arguments}
tool_result → role="tool" message with tool_call_id
server_tools accepts this directly (ChatMessage.content: Any | None,
tool_calls + tool_call_id fields). What's on the wire matches what
real OpenAI-compat agent CLIs send, so the bench measures the path
that production traffic actually takes.
3. Token-count fallback in _stream_chat. PR Luce-Org#59's server does NOT
honour stream_options.include_usage — no usage chunk is ever
emitted on /v1/chat/completions. Without a fallback every call
reports n_tok=0 even when 64 content deltas streamed. Now we
prefer usage.completion_tokens when present, otherwise count
content/reasoning/tool deltas as a proxy.
System prompt and warmup-call still in place from the previous
commit; --n-gen default still 64.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Self-contained reproducer for the multi-slot inline-snap regression
in prefix_cache.py + qwen35_target_graph.cpp. No transcript dependency:
generates a 6-call growing-prefix sequence with synthetic pylint-style
tool results, runs against slots=0 (control) then slots=2 (repro),
prints per-call content/reasoning/finish + a side-by-side table.
Trigger profile: starting at the second multi-turn call (~5K char
prefix), warm responses become content_len=0 / comp_tok=0 /
finish_reason=stop; subsequent calls return in <50 ms, also empty.
Cold path on the same prompts produces 64 tokens per call.
Suppresses the GGML gdb-fork backtrace handler via GGML_NO_BACKTRACE=1
so the daemon log stays readable when ggml-cuda hits its
"device not ready" error path during the lazy snap-buffer alloc.
Usage:
python3 dflash/scripts/repro_empty_response.py \
--target /path/to/Qwen3.6-27B*.gguf \
--draft /path/to/qwen3.6-27b-dflash \
--bin /path/to/dflash/build/test_dflash \
--server /path/to/dflash/scripts/server_tools.py
Exits 0 on confirmed repro, 1 if every warm call produced output
(threshold not reached — bump --n-turns / --tool-chars), 2 if cold
itself returned empty (different problem).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6d74d18 to
81e12e4
Compare
…odule Bumps the llama.cpp submodule to a fix that addresses the prefix-cache empty-response bug at its root: ggml-cuda's VMM allocator's pool extension via cuMemSetAccess races with in-flight async work and returns CUDA_ERROR_NOT_READY. The CU_CHECK macro hits GGML_ABORT but the abort doesn't actually terminate, leaving the just-mapped region without access permissions. Every subsequent read/write into that region silently misbehaves — for the prefix cache, snapshots of KV state get stored into the broken region and restore as zeroed/garbled state, making the model emit 0 tokens with finish_reason=stop. Manifests on PR Luce-Org#59's inline-snap path because it interleaves compute with allocations on the same backend (snapshot copies during prefill followed by gallocr / rollback / cache rebuild allocations). The fix adds a cudaDeviceSynchronize before the cuMem* sequence in the pool extension branch — only fires when the pool actually grows, so steady-state hot-path allocations are unaffected. llama.cpp PR: Luce-Org/llama.cpp-dflash-ggml#4 Submodule URL temporarily pointed at easel's fork (branch fix/cuda-vmm-pool-extension-race) until the upstream PR merges. After merge, revert .gitmodules to Luce-Org/llama.cpp.git@luce-dflash and bump the submodule pointer to the merge commit. Also bumps prefix_cache.startup_sync's await_reply timeout 10s → 60s for daemons with multi-slot snap pools at large max-ctx. Validated on RTX 3090 Ti, CUDA 13.2, Qwen 3.6 27B, max-ctx=24576, slots=2: session turns TTFT cold TTFT warm TTFT x wall x empties lucebox-hub 10 40.0s 39.9s 1.00x 0.77x 0/10 nexiq-small 6 55.9s 44.9s 1.24x 0.75x 0/6 axon-med 10 133.8s 51.7s 2.59x 1.27x 0/10 helix-large 10 242.1s 97.3s 2.49x 1.52x 0/10 36/36 warm calls produce real content. Cache delivers 2.49–2.59x TTFT speedup on long agentic prefixes (38K–70K chars) — the headline win this PR set out to validate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
81e12e4 to
dedaf69
Compare
b597e8f to
7e143f9
Compare
e216877 to
cbdf9c8
Compare
…s layers Cross-attention with TQ3_0 KV cache produced accept_rate=0 because three separate issues compounded: 1. K/V views were cast from TQ3_0 to F16/F32 before ggml_flash_attn_ext. The CUDA FA kernels apply forward FWHT to Q (and inverse FWHT to the output) only when they observe K->type == GGML_TYPE_TQ3_0 (fattn-chunked.cu:228,394; fattn-vec.cuh:168). Casting stripped the type tag, FA picked a non-WHT kernel, and Q (real domain) dotted with K (FWHT domain, just unpacked into F16) produced meaningless scores. Removed the cast; Kfa/Vfa now reach FA with native TQ3_0. 2. TQ3_0 K is iterated in 128-element block strides; an unaligned ne[1] reads past the valid window into stale cache cells. Previously we only padded for head_dim>=512; SWA layers (head_dim=256) skipped padding and silently corrupted attention. Extended needs_kv_pad to fire for any TQ3_0 cache, mirroring gemma4_target_graph.cpp's need_256_pad policy. 3. Each layer created its own FA mask input tensor but only the last one was exposed via out.fa_mask. After fix #2 all four layers needed masks; the unfilled mask buffers contained uninitialised CUDA memory (cudaMalloc is not zeroed), causing NaN logits on subsequent steps. Hoisted a single shared mask out of the per-layer loop. The builder now asserts that all need-mask layers want the same (width, kv_seq_len) and fails loudly if a future long-context build wants per-layer masks (SWA cap < full attn_pos), instead of silently doing the wrong thing. Trajectory: pre-fix: accept_rate = 0.00 (varying garbage tokens) fix #1 only: accept_rate = 0.00 (drafts pinned to a single token) fix #1+#2: step 1 OK, step 2+ NaN fix #1+#2+Luce-Org#3: accept_rate = 0.22 (Q4_K_M target + Q8_0 assistant, TQ3_0 KV, 131-token prompt, 64 generation steps) Adjacent infrastructure: - create_gemma4_cache(): extra_q8_layers param to force Q8_0 on specific MTP donor layers when needed. - get_mtp_swa_pattern(): lightweight helper reading MTP SWA layout from GGUF without loading tensors. - MTP loader: load centroids/token_ordering whenever n_centroids>0 (graph builder decides whether to use them). - Test caller: fills out.fa_mask before each compute; dropped the per-step diagnostic prints that are no longer needed. Known follow-ups (not blocking): - Long-context multi-mask: SWA cap < full attn_pos trips the assert. - SWA-wrap branch concat-forces F32 on TQ3_0, losing the WHT path. - Accept rate 0.22 is in expected range; remaining gap to spike's reference numbers may come from quantization, RoPE source, or attention scale.
…ync stream) #1: recommit verify wrote accept_n+2 KV slots vs fast-path's accept_n+1; runner's base_pos += accept_n+2 then re-verified bonus at its own slot, skipping every Dth token at γ≥3 partial-accept (reproduced AR/spec divergence in new test_recommit_byte_identical_to_ar). #2: capture_topology_for_chain() virtual; runner owns the call. verify_batch no longer mutates last_tree_*. Luce-Org#3: dedicated rollback CUDA stream + cudaMemcpy2DAsync batching in restore_kv_at_dfs (4× fewer launches per layer). Bug Luce-Org#5 (step_sg_cache O(n_ctx)) deferred — needs ggml_set_rows refactor.
…ync stream) #1: recommit verify wrote accept_n+2 KV slots vs fast-path's accept_n+1; runner's base_pos += accept_n+2 then re-verified bonus at its own slot, skipping every Dth token at γ≥3 partial-accept (reproduced AR/spec divergence in new test_recommit_byte_identical_to_ar). #2: capture_topology_for_chain() virtual; runner owns the call. verify_batch no longer mutates last_tree_*. Luce-Org#3: dedicated rollback CUDA stream + cudaMemcpy2DAsync batching in restore_kv_at_dfs (4× fewer launches per layer). Bug Luce-Org#5 (step_sg_cache O(n_ctx)) deferred — needs ggml_set_rows refactor.
…ync stream) #1: recommit verify wrote accept_n+2 KV slots vs fast-path's accept_n+1; runner's base_pos += accept_n+2 then re-verified bonus at its own slot, skipping every Dth token at γ≥3 partial-accept (reproduced AR/spec divergence in new test_recommit_byte_identical_to_ar). #2: capture_topology_for_chain() virtual; runner owns the call. verify_batch no longer mutates last_tree_*. Luce-Org#3: dedicated rollback CUDA stream + cudaMemcpy2DAsync batching in restore_kv_at_dfs (4× fewer launches per layer). Bug Luce-Org#5 (step_sg_cache O(n_ctx)) deferred — needs ggml_set_rows refactor.
…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, Luce-Org#207 already merged) plus the first slice of howard0su's PR Luce-Org#214 review request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. ## What landed ### Foundation (rebase port, ~5k LOC) - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict ### Step 3.1 — common::mtp::warm_and_decode entry point (TDD red→green) Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS ## Steps 3.2-3.5 (separate commits, this PR) 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
…and_decode (step 3.1) Rebase of the MTP-via-daemon work onto latest main (PRs Luce-Org#213, Luce-Org#210, Luce-Org#208, request: move MTP orchestration into dflash/src/common/ behind a generic entry point any ModelBackend can call. - `dflash/src/qwen36/qwen36_mtp.{cpp,h}` (2.3k LOC) — Qwen3.6 native-heads MTP module (Qwen36MtpModule, implements INativeMtp) - `dflash/src/qwen36/qwen36_mtp_graph.{cpp,h}` — MTP head forward graph - `dflash/src/qwen36/qwen36_mtp_loader.cpp` — NextN tensor loader from GGUF - `dflash/src/common/mtp_interface.h` — abstract IMtpModule + flavor mixins - `dflash/src/common/mtp_chain_runner.{cpp,h}` — generic γ-loop runner - `dflash/src/common/{gguf_metadata,gguf_mmap,step_graph,model_backend}.h` + `attn_masks.h` + `dflash_target.h` updates: shared infrastructure - `dflash/src/qwen35/qwen35_backend.{cpp,h}` — extended with optional Qwen36MtpModule, init_mtp_, warm_mtp_for_prompt_, do_mtp_prefill_, do_mtp_decode_ (will be slimmed once orchestrator absorbs them, step 3.3) - `dflash/src/qwen35/qwen35_daemon.{cpp,h}` — DaemonArgs carry MTP fields - `dflash/src/qwen35/qwen35_dflash_target.{cpp,h}` + `qwen35_target_graph.cpp` — hidden-sequence capture path for MTP head warming - `dflash/test/test_dflash.cpp` — daemon dispatch routes `--daemon --mtp-gguf` to run_qwen35_daemon (file-mode harness preserved) - `dflash/scripts/server.py` — `--mtp-gguf`/`--mtp-gamma`/`--mtp-draft-source` CLI flags, MTP-mode spawn-cmd branch, layered on top of mrciffa's thinking-default fixes (commit 998b280) without conflict Howard's review: > "MTP should be simple as additional weights of modelbackend. If a model > contains MTP support (gemma4 or qwen3.5), the logic can handle it. In > other words, the logic should be in /common which can potentially > leverage by any modelbackend if they support mtp." Carved out the public surface for the future orchestrator: GenerateResult dflash27b::common::mtp::warm_and_decode( ModelBackend * backend, const GenerateRequest & req, const DaemonIO & io); New files: - `dflash/src/common/mtp_orchestrator.{cpp,h}` — header pins the signature, cpp is a minimal stub that only handles guard cases (null backend, no MTP support, empty prompt). Real warm + decode body lands in step 3.2, driven by additional red→green tests. - `dflash/test/test_common_mtp_orchestrator.cpp` — three guard tests written and watched fail BEFORE the stub existed (compile-time RED: "common/mtp_orchestrator.h: No such file or directory"), then GREEN after the stub returned matching error strings. Test results: T1 null_backend PASS T2 backend_without_mtp PASS T3 empty_prompt PASS ALL PASS 3.2 fill warm_and_decode body (chunked prefill via DFlashTarget::verify_batch + hidden capture + MtpChainRunner.run); red test = identical token IDs vs reference run_qwen36_mtp_harness on a fixed prompt. 3.3 replace Qwen35Backend::do_mtp_decode_/do_mtp_prefill_ with calls to common::mtp::warm_and_decode; delete the qwen35-local helpers. 3.4 stub Gemma4Backend MTP override using the same common entry point to prove the interface is generic (not Qwen35-specific). 3.5 audit common/mtp_orchestrator + mtp_chain_runner for any hand-rolled CPU loops; replace with ggml primitives per howard's point #1. Then retest 24K baseline post-RoPE-fix (howard's other comment) and update PR description with current numbers. Addresses: - davide221 Luce-Org#214#issuecomment-4472910706 (merge conflicts) — rebased - howard0su Luce-Org#214#review (changes requested points 2, 3, 4) — first slice
… (seeds #1, #2) - StubServer: ThreadingHTTPServer recorder, zero new deps (mirrors llamacpp_compat_proxy.py pattern) - Seed #2 green: proxy injects session_id on /v1/messages, preserves existing, passes through GET - Seed #1 documented: chat/completions round-trip passes; injection assertion commented out pending commit 3
Targets
feat/prefix-cache(PR Luce-Org#59). Drafted for review.Five commits, fairly cleanly split: bench loader → bench tooling fixes → repro script → upstream cuda fix.
1. Faithful transcript replay (
bench_agent_loop.py)extract_user_turnsreads onlytype=userrecords wherecontentis astr. Real Claude Code transcripts have user records oftool_resultblocks withcontentas alist— silently skipped. The bench replays only typed human prompts with bench-synthesised assistant replies, with no tool I/O.The replacement loader (
_load_transcript+_to_openai_messages) coalesces consecutive same-role records into single turns, converts Anthropic blocks → OpenAI structured messages (tool_use → assistant.tool_calls, tool_result → role=tool, thinking dropped), and at each assistant index sends the exact prefix that was sent at that point. Validated against a 65-record session: old loader extracts 7 typed-user turns; new loader walks all 32 call points with prefix at the last call = 95,536 chars. Targetsdflash/scripts/server_tools.py(the production agent endpoint).Also: streaming SSE for separate TTFT measurement, and a token-count fallback that counts content/reasoning/tool deltas when
usage.completion_tokensisn't emitted (PR Luce-Org#59's server doesn't honourstream_options.include_usage).2. Standalone empty-response reproducer (
repro_empty_response.py)Self-contained, no transcript dependency. Generates 6 growing-prefix calls (130 → 5K → 10K → 15K → 20K → 26K chars) with synthetic pylint-style tool results, runs against slots=0 (control) then slots=2 (repro), prints a side-by-side table.
Pre-fix (slots=2): warm call 2 returns content=0 / comp_tok=0 / finish=stop after 9.7s of real prefill; calls 3-6 return empty in <50ms (cache "hits" on a corrupted slot). Reproduces deterministically.
3. Root cause + upstream fix (
dflash/deps/llama.cpp)The bug is in upstream ggml-cuda's VMM allocator, not in PR Luce-Org#59 itself. PR Luce-Org#59 just reliably triggers it because its inline-snap pattern interleaves compute with allocations on the same backend.
Mechanism:
alloc()path extends the pool viacuMemCreate+cuMemMap+cuMemSetAccesswhen an allocation can't fit in the existing mapped region.cuMemSetAccessraces with previously-queued async work and returnsCUDA_ERROR_NOT_READY.CU_CHECKfiresGGML_ABORT, but in some process configurations the abort doesn't terminate (gdb-fork backtrace path interferes, SIGABRT swallowed somewhere — exact mechanism unclear, but reproducible).Fix is one
cudaDeviceSynchronize()before thecuMem*sequence inside the pool extension branch:Only fires when the pool actually needs to grow, so steady-state hot-path allocations are unaffected.
llama.cpp PR: Luce-Org/llama.cpp-dflash-ggml#4
This PR's submodule URL is temporarily pointed at easel's fork (
fix/cuda-vmm-pool-extension-racebranch) until that PR merges; revert toLuce-Org/llama.cpp.git@luce-dflashafter merge.Also bumps
prefix_cache.startup_sync'sawait_replytimeout 10s → 60s for daemons with multi-slot snap pools at large--max-ctx.Validation
repro_empty_response.pyexits 0 (repro confirmed) pre-fix, exits 1 ("DID NOT REPRO") post-fix.4-session sweep on RTX 3090 Ti, CUDA 13.2, Qwen 3.6 27B,
--max-ctx=24576 --warm-slots=2:Zero empty responses across 36 warm calls. Cache wins scale with prefix size — 2.49–2.59× TTFT and 1.27–1.52× wall on real long-prefix agentic sessions is the headline win this PR (and PR Luce-Org#59) set out to validate. Short sessions barely break even, as expected (cache-creation overhead vs. limited reuse).
Test plan
repro_empty_response.py— pre-fix repros, post-fix doesn'ttest_multi_turn_prefix_cache.pystill passes — should, since it's checking turn-2-5 ratios on a small fixture that probably didn't trigger the bug anywayStatus
Draft. Two PRs need review:
Happy to split this PR if reviewer prefers (bench-only vs submodule-bump), but the bench changes don't make sense without the underlying fix being available.
🤖 Generated with Claude Code