[codex] Recover dflash spec-decode agent stalls#315
Conversation
dflash speculative decoding emits EOS as the very first token on certain agentic "decision" turns, returning an empty completion (0 tokens, finish=stop, no tool_call). The agent loop then stalls with nothing to run -- the user-visible "dflash stops the moment it needs to do something." At temperature 0 spec-decode must equal AR greedy, so this is a spec-decode correctness bug (the batched target verify diverges from AR on the first emitted position for these contexts). Root cause isolated with a reproducible eval (stateless dflash-nocache lane, two full passes byte-identical, jaccard 1.0): 9 of 55 real captured agentic turns produce empty output under spec-decode, deterministically. The SAME turns produce correct non-empty output on the AR path (no draft/ddtree) and on stock llama.cpp Q6 -- so it is specific to the dflash spec-decode path, not the prompt or the model weights. Fix: at the two do_spec_decode call sites, if it returns success but emitted zero tokens, fall back to do_ar_decode. The trigger is strictly "0 tokens emitted", so healthy turns (which emit >=1 token) never reach the fallback -- blast radius is exactly the currently-100%-failing turns. do_ar_decode is the existing autoregressive path, verified correct here. Validated (all turns replayed in isolation on the reproducible lane): - 9/9 empty turns now produce non-empty output - those 9 outputs are BYTE-IDENTICAL to the AR lane (fallback resumes from correct state) - full 55-turn sweep: empty count 9 -> 0, zero regressions to empty, tool-call set unchanged - prefix-cache oracle still 6/6 bit-identical - 12-turn consecutive sequence on the stateful cache-on lane: 0 empties (no cross-turn state poisoning) Spec-decode speed is retained for every non-degenerate turn; the slower AR path runs only on the rare empty case (which was already failing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Code Review: dflash spec-decode agent stallsVerdict: approve (with suggestions)The implementation is functionally correct and aligned with the task goal. The env-gated approach ( Non-blocking findings1.
2. The 3. When 4. Duplicated Both 5. Unbounded Every stall/recovery event appends to 6.
|
There was a problem hiding this comment.
2 issues found across 4 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Summary
Root Cause
The original EOS floor lived in the AR path, but most agentic stalls exit through the spec-decode replay/emit loop. That let Q4 accept short action preambles like "let me check ...:" and stop before emitting tool XML.
The remaining captured failure, req_0031, is different: dflash begins a plausible execute_code call, then degenerates mid-code into a repeated punctuation run before closing XML. This PR intentionally does not salvage or execute that incomplete code. It marks the decode as a bounded length-class failure so the Hermes Q6 retry can recover safely.
Note
This upstream PR is opened from the same fork head as the mirror PR. It includes the prerequisite empty-output spec-decode fallback commit because upstream main does not yet contain that fork-side fix.
Validation