Skip to content

[codex] Recover dflash spec-decode agent stalls#315

Open
OmarB97 wants to merge 3 commits into
Luce-Org:mainfrom
OmarB97:codex/dflash-spec-tool-recovery
Open

[codex] Recover dflash spec-decode agent stalls#315
OmarB97 wants to merge 3 commits into
Luce-Org:mainfrom
OmarB97:codex/dflash-spec-tool-recovery

Conversation

@OmarB97
Copy link
Copy Markdown
Contributor

@OmarB97 OmarB97 commented May 31, 2026

Summary

  • Add env-gated recovery for Qwen35 dflash spec-decode stalls where the model emits an action preamble and then EOS before a tool call.
  • Inject a minimal tool-call XML prefix for tool-present turns, replay KV to the right boundary, then continue in AR decode.
  • Make the recovery prefix value-aware and tool-choice/schema-aware so forced non-terminal tools do not receive terminal XML.
  • Tail off to AR when spec-decode produces an invalid next draft seed instead of returning a decode error.
  • Add an env-gated repeated-token guard for the residual malformed-tool-buffer case so it becomes bounded and retryable instead of burning the whole token budget.

Root Cause

The original EOS floor lived in the AR path, but most agentic stalls exit through the spec-decode replay/emit loop. That let Q4 accept short action preambles like "let me check ...:" and stop before emitting tool XML.

The remaining captured failure, req_0031, is different: dflash begins a plausible execute_code call, then degenerates mid-code into a repeated punctuation run before closing XML. This PR intentionally does not salvage or execute that incomplete code. It marks the decode as a bounded length-class failure so the Hermes Q6 retry can recover safely.

Note

This upstream PR is opened from the same fork head as the mirror PR. It includes the prerequisite empty-output spec-decode fallback commit because upstream main does not yet contain that fork-side fix.

Validation

  • cmake --build server/build -j$(nproc)
  • ./server/build/test_server_unit: 1608 assertions, 0 failures
  • Captured corpus through llama-swap with DFLASH_MIN_TOKENS=64 and DFLASH_STALL_TOOL_PREFIX=1: 16/17 stall turns produce real tool calls, 0/2 legit controls produce tool calls
  • req_0031 now terminates as bounded finish_reason=length at 321 completion tokens instead of returning a decode error or running to the full hidden tool-buffer budget
  • Cache oracle with env off remains covered by the prior PR validation; this follow-up is active only behind the same recovery envs

OmarB97 and others added 2 commits May 30, 2026 11:17
dflash speculative decoding emits EOS as the very first token on certain
agentic "decision" turns, returning an empty completion (0 tokens,
finish=stop, no tool_call). The agent loop then stalls with nothing to
run -- the user-visible "dflash stops the moment it needs to do
something." At temperature 0 spec-decode must equal AR greedy, so this is
a spec-decode correctness bug (the batched target verify diverges from AR
on the first emitted position for these contexts).

Root cause isolated with a reproducible eval (stateless dflash-nocache
lane, two full passes byte-identical, jaccard 1.0): 9 of 55 real captured
agentic turns produce empty output under spec-decode, deterministically.
The SAME turns produce correct non-empty output on the AR path (no
draft/ddtree) and on stock llama.cpp Q6 -- so it is specific to the
dflash spec-decode path, not the prompt or the model weights.

Fix: at the two do_spec_decode call sites, if it returns success but
emitted zero tokens, fall back to do_ar_decode. The trigger is strictly
"0 tokens emitted", so healthy turns (which emit >=1 token) never reach
the fallback -- blast radius is exactly the currently-100%-failing turns.
do_ar_decode is the existing autoregressive path, verified correct here.

Validated (all turns replayed in isolation on the reproducible lane):
- 9/9 empty turns now produce non-empty output
- those 9 outputs are BYTE-IDENTICAL to the AR lane (fallback resumes
  from correct state)
- full 55-turn sweep: empty count 9 -> 0, zero regressions to empty,
  tool-call set unchanged
- prefix-cache oracle still 6/6 bit-identical
- 12-turn consecutive sequence on the stateful cache-on lane: 0 empties
  (no cross-turn state poisoning)

Spec-decode speed is retained for every non-degenerate turn; the slower
AR path runs only on the rare empty case (which was already failing).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@OmarB97 OmarB97 marked this pull request as ready for review May 31, 2026 01:03
@OmarB97
Copy link
Copy Markdown
Contributor Author

OmarB97 commented May 31, 2026

Code Review: dflash spec-decode agent stalls

Verdict: approve (with suggestions)

The implementation is functionally correct and aligned with the task goal. The env-gated approach (DFLASH_MIN_TOKENS, DFLASH_STALL_TOOL_PREFIX) ensures zero behavioral change in the default production lane. The stall detection and recovery flow — detect preamble→EOS in spec-decode replay, inject tool-call XML prefix, replay KV, fall back to AR — is sound.

Non-blocking findings

1. tokens_contain for skip detection is full-sequence, not suffix-anchored (qwen35_backend.cpp ~L1461)

tokens_contain(out_tokens, *stall_skip_tokens) searches for the exact sequence " done" anywhere in out_tokens, not just at the end. If " done" appeared earlier in the conversation (e.g., in a previous assistant turn or the system prompt), the skip guard would fire on a stale match. Consider switching to a suffix-only check (e.g., searching only the last N+len(skip) positions) to avoid false negatives on stall recovery.

2. tokens_have_recent_any matches individual sub-tokens, not multi-token suffixes (http_server.cpp + qwen35_backend.cpp)

The stall_action_suffix_tokens vector is constructed by taking only the last token of each encoded suffix variant (":", "`:", "):", "":"). The tokens_have_recent_any check then looks for any of these individual tokens in the last 4 positions. This works when the relevant tokenizer always emits ":" as a single token, but if tokenization splits a suffix (e.g., "`:" → two tokens), the individual sub-token ":" could appear coincidentally. Consider adding a comment explaining why this flat-individual-token approach is safe for this tokenizer.

3. empty_result → AR fallback has no logging (qwen35_backend.cpp ~L608, ~L721)

When do_spec_decode returns true with empty result.tokens, the code silently falls back to do_ar_decode. This path is hit without logging to stdout/stderr or /tmp/dflash_floor.log. Consider adding a diagnostic fprintf here to distinguish "floor recovered" from "silent fallback" when debugging.

4. Duplicated _min_floor static init (qwen35_backend.cpp ~L1051 and ~L1211)

Both do_ar_decode and do_spec_decode initialize a static _min_floor via the same lambda reading getenv("DFLASH_MIN_TOKENS"). A shared helper (e.g., get_dflash_min_tokens()) would prevent drift.

5. Unbounded /tmp/dflash_floor.log growth

Every stall/recovery event appends to /tmp/dflash_floor.log without rotation or size cap. On a production server with frequent agentic turns, this could fill /tmp. Consider a max-size guard or documented log rotation expectation.

6. last_tok scoping change (qwen35_backend.cpp ~L1420)

last_tok = replay_last_tok was moved from its original unconditional position into an else block, and the floor_to_ar path sets cache_.last_tok directly before returning. I verified this is correct — the floor_to_ar path always returns via step_graph_destroy + return ok rather than falling through to the next loop iteration where last_tok would be needed. However, the control flow is now harder to follow. Consider a brief comment above the else block explaining why last_tok is conditionally updated.

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 4 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/server/http_server.cpp
Comment thread server/src/server/http_server.cpp Outdated
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants