[codex] Recover dflash spec-decode agent stalls by OmarB97 · Pull Request #315 · Luce-Org/lucebox-hub

OmarB97 · 2026-05-31T01:00:10Z

Summary

Add env-gated recovery for Qwen35 dflash spec-decode stalls where the model emits an action preamble and then EOS before a tool call.
Inject a minimal tool-call XML prefix for tool-present turns, replay KV to the right boundary, then continue in AR decode.
Make the recovery prefix value-aware and tool-choice/schema-aware so forced non-terminal tools do not receive terminal XML.
Tail off to AR when spec-decode produces an invalid next draft seed instead of returning a decode error.
Add an env-gated repeated-token guard for the residual malformed-tool-buffer case so it becomes bounded and retryable instead of burning the whole token budget.

Root Cause

The original EOS floor lived in the AR path, but most agentic stalls exit through the spec-decode replay/emit loop. That let Q4 accept short action preambles like "let me check ...:" and stop before emitting tool XML.

The remaining captured failure, req_0031, is different: dflash begins a plausible execute_code call, then degenerates mid-code into a repeated punctuation run before closing XML. This PR intentionally does not salvage or execute that incomplete code. It marks the decode as a bounded length-class failure so the Hermes Q6 retry can recover safely.

Note

This upstream PR is opened from the same fork head as the mirror PR. It includes the prerequisite empty-output spec-decode fallback commit because upstream main does not yet contain that fork-side fix.

Validation

cmake --build server/build -j$(nproc)
./server/build/test_server_unit: 1608 assertions, 0 failures
Captured corpus through llama-swap with DFLASH_MIN_TOKENS=64 and DFLASH_STALL_TOOL_PREFIX=1: 16/17 stall turns produce real tool calls, 0/2 legit controls produce tool calls
req_0031 now terminates as bounded finish_reason=length at 321 completion tokens instead of returning a decode error or running to the full hidden tool-buffer budget
Cache oracle with env off remains covered by the prior PR validation; this follow-up is active only behind the same recovery envs

dflash speculative decoding emits EOS as the very first token on certain agentic "decision" turns, returning an empty completion (0 tokens, finish=stop, no tool_call). The agent loop then stalls with nothing to run -- the user-visible "dflash stops the moment it needs to do something." At temperature 0 spec-decode must equal AR greedy, so this is a spec-decode correctness bug (the batched target verify diverges from AR on the first emitted position for these contexts). Root cause isolated with a reproducible eval (stateless dflash-nocache lane, two full passes byte-identical, jaccard 1.0): 9 of 55 real captured agentic turns produce empty output under spec-decode, deterministically. The SAME turns produce correct non-empty output on the AR path (no draft/ddtree) and on stock llama.cpp Q6 -- so it is specific to the dflash spec-decode path, not the prompt or the model weights. Fix: at the two do_spec_decode call sites, if it returns success but emitted zero tokens, fall back to do_ar_decode. The trigger is strictly "0 tokens emitted", so healthy turns (which emit >=1 token) never reach the fallback -- blast radius is exactly the currently-100%-failing turns. do_ar_decode is the existing autoregressive path, verified correct here. Validated (all turns replayed in isolation on the reproducible lane): - 9/9 empty turns now produce non-empty output - those 9 outputs are BYTE-IDENTICAL to the AR lane (fallback resumes from correct state) - full 55-turn sweep: empty count 9 -> 0, zero regressions to empty, tool-call set unchanged - prefix-cache oracle still 6/6 bit-identical - 12-turn consecutive sequence on the stateful cache-on lane: 0 empties (no cross-turn state poisoning) Spec-decode speed is retained for every non-degenerate turn; the slower AR path runs only on the rare empty case (which was already failing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

OmarB97 · 2026-05-31T01:04:30Z

Code Review: dflash spec-decode agent stalls

Verdict: approve (with suggestions)

The implementation is functionally correct and aligned with the task goal. The env-gated approach (DFLASH_MIN_TOKENS, DFLASH_STALL_TOOL_PREFIX) ensures zero behavioral change in the default production lane. The stall detection and recovery flow — detect preamble→EOS in spec-decode replay, inject tool-call XML prefix, replay KV, fall back to AR — is sound.

Non-blocking findings

1. tokens_contain for skip detection is full-sequence, not suffix-anchored (qwen35_backend.cpp ~L1461)

tokens_contain(out_tokens, *stall_skip_tokens) searches for the exact sequence " done" anywhere in out_tokens, not just at the end. If " done" appeared earlier in the conversation (e.g., in a previous assistant turn or the system prompt), the skip guard would fire on a stale match. Consider switching to a suffix-only check (e.g., searching only the last N+len(skip) positions) to avoid false negatives on stall recovery.

2. tokens_have_recent_any matches individual sub-tokens, not multi-token suffixes (http_server.cpp + qwen35_backend.cpp)

The stall_action_suffix_tokens vector is constructed by taking only the last token of each encoded suffix variant (":", "`:", "):", "":"). The tokens_have_recent_any check then looks for any of these individual tokens in the last 4 positions. This works when the relevant tokenizer always emits ":" as a single token, but if tokenization splits a suffix (e.g., "`:" → two tokens), the individual sub-token ":" could appear coincidentally. Consider adding a comment explaining why this flat-individual-token approach is safe for this tokenizer.

3. empty_result → AR fallback has no logging (qwen35_backend.cpp ~L608, ~L721)

When do_spec_decode returns true with empty result.tokens, the code silently falls back to do_ar_decode. This path is hit without logging to stdout/stderr or /tmp/dflash_floor.log. Consider adding a diagnostic fprintf here to distinguish "floor recovered" from "silent fallback" when debugging.

4. Duplicated _min_floor static init (qwen35_backend.cpp ~L1051 and ~L1211)

Both do_ar_decode and do_spec_decode initialize a static _min_floor via the same lambda reading getenv("DFLASH_MIN_TOKENS"). A shared helper (e.g., get_dflash_min_tokens()) would prevent drift.

5. Unbounded /tmp/dflash_floor.log growth

Every stall/recovery event appends to /tmp/dflash_floor.log without rotation or size cap. On a production server with frequent agentic turns, this could fill /tmp. Consider a max-size guard or documented log rotation expectation.

6. last_tok scoping change (qwen35_backend.cpp ~L1420)

last_tok = replay_last_tok was moved from its original unconditional position into an else block, and the floor_to_ar path sets cache_.last_tok directly before returning. I verified this is correct — the floor_to_ar path always returns via step_graph_destroy + return ok rather than falling through to the next loop iteration where last_tok would be needed. However, the control flow is now harder to follow. Consider a brief comment above the else block explaining why last_tok is conditionally updated.

cubic-dev-ai

2 issues found across 4 files

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

…recovery

OmarB97 and others added 2 commits May 30, 2026 11:17

fix(qwen35): recover spec-decode agent stalls

6c8db53

OmarB97 marked this pull request as ready for review May 31, 2026 01:03

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

Comment thread server/src/server/http_server.cpp

Comment thread server/src/server/http_server.cpp Outdated

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026

Merge pull request Luce-Org#315 from Luce-Org/codex/dflash-spec-tool-…

cb6dde1

…recovery

fix(qwen35): bound residual dflash stall loops

3ba401f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Recover dflash spec-decode agent stalls#315

[codex] Recover dflash spec-decode agent stalls#315
OmarB97 wants to merge 3 commits into
Luce-Org:mainfrom
OmarB97:codex/dflash-spec-tool-recovery

OmarB97 commented May 31, 2026 •

edited

Loading

Uh oh!

OmarB97 commented May 31, 2026 •

edited

Loading

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OmarB97 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Note

Validation

Uh oh!

OmarB97 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: dflash spec-decode agent stalls

Verdict: approve (with suggestions)

Non-blocking findings

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OmarB97 commented May 31, 2026 •

edited

Loading

OmarB97 commented May 31, 2026 •

edited

Loading

cubic-dev-ai Bot left a comment •

edited

Loading