[hermes-dflash-phone-cli][2/n] Preserve compute failures during cancellation by OmarB97 · Pull Request #9 · OmarB97/lucebox-hub

OmarB97 · 2026-05-31T17:43:27Z

Mirror follow-up for Luce-Org#324 after #8 had already merged.\n\nThis covers the review-fix commit 47fd712 only: preserve completed ggml graph_compute failures ahead of cancellation checks so disconnect handling cannot mask a backend compute failure.\n\nVerification:\n- taro clean clone: cmake configure/build test_server_unit\n- taro clean clone: server/build/test_server_unit => 1647 assertions, 0 failures\n\nUpstream review context: cubic identified the masking issue on Luce-Org#324.

dflash speculative decoding emits EOS as the very first token on certain agentic "decision" turns, returning an empty completion (0 tokens, finish=stop, no tool_call). The agent loop then stalls with nothing to run -- the user-visible "dflash stops the moment it needs to do something." At temperature 0 spec-decode must equal AR greedy, so this is a spec-decode correctness bug (the batched target verify diverges from AR on the first emitted position for these contexts). Root cause isolated with a reproducible eval (stateless dflash-nocache lane, two full passes byte-identical, jaccard 1.0): 9 of 55 real captured agentic turns produce empty output under spec-decode, deterministically. The SAME turns produce correct non-empty output on the AR path (no draft/ddtree) and on stock llama.cpp Q6 -- so it is specific to the dflash spec-decode path, not the prompt or the model weights. Fix: at the two do_spec_decode call sites, if it returns success but emitted zero tokens, fall back to do_ar_decode. The trigger is strictly "0 tokens emitted", so healthy turns (which emit >=1 token) never reach the fallback -- blast radius is exactly the currently-100%-failing turns. do_ar_decode is the existing autoregressive path, verified correct here. Validated (all turns replayed in isolation on the reproducible lane): - 9/9 empty turns now produce non-empty output - those 9 outputs are BYTE-IDENTICAL to the AR lane (fallback resumes from correct state) - full 55-turn sweep: empty count 9 -> 0, zero regressions to empty, tool-call set unchanged - prefix-cache oracle still 6/6 bit-identical - 12-turn consecutive sequence on the stateful cache-on lane: 0 empties (no cross-turn state poisoning) Spec-decode speed is retained for every non-degenerate turn; the slower AR path runs only on the rare empty case (which was already failing). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fixes Luce-Org#102 Adds --target-gpu / --draft-gpu CLI flags and DFLASH_TARGET_GPU / DFLASH_DRAFT_GPU environment variables to the README server configuration table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…fallback

OmarB97 · 2026-05-31T17:53:02Z

This is the replacement fork mirror for the current head of Luce-Org#324.

Why this exists:

fix: cancel disconnected dflash requests #8 already merged before review-fix commit 47fd712.
The upstream PR remains open and now points at 47fd712, so the old mirror only covers the earlier commit range.
This PR mirrors the current upstream head for fork-side review/settlement.

Verification:

Clean taro clone /tmp/lucebox-pr324-review-fix: CMake configure/build for test_server_unit succeeded.
server/build/test_server_unit passed: 1647 assertions, 0 failures.

OmarB97 · 2026-06-02T01:29:14Z

Post-merge reviewer dispatch (hermes-local-longctx-ko-mac)

Verdict: APPROVE (with non-blocking suggestions below)

Reviewed the full diff across 17 files (~757 additions). This is a substantial and well-structured change. Summary of findings:

What works well

Consistent cancellation pattern: should_cancel() is uniformly applied across all backends (Gemma4, Laguna, Qwen3, Qwen3.5, Qwen3.5 MoE) and both spec-decode and AR paths. No backends were missed.
Clean separation of concerns: DaemonComputeResult enum correctly prioritizes compute failures over cancellation (a graph error is a backend bug, not a client hang-up).
Empty spec fallback: The retry mechanism for spec-decode producing only EOS/thinking tokens is well-designed. force_ar_decode flag cleanly bypasses spec without duplicating retry policy in each backend.
Visible output tracking: visible_output_seen prevents caching snapshots/checkpoints when generation produced no client-visible output. Good optimization.
Callback-based cancellation: CancelCallback on DaemonIO cleanly decouples HTTP server from backend layer.

Non-blocking suggestions

Thread-per-request overhead: RequestDisconnectWatcher spawns one std::thread per request. Under high concurrency this could be expensive. Consider a shared poller or edge-triggered epoll approach for production workloads.
Truncated tokenizer diff: The tokenizer.cpp changes were truncated by preflight. The encode(rendered, request_cancelled) call in http_server.cpp implies a new overload accepting a cancel callback — verify this compiles and the cancel callback is actually checked during tokenization.
README scope creep: The GPU selection flag documentation (--target-gpu, --draft-gpu, env vars) appears unrelated to the cancellation/empty-output-retry goal. Consider landing that in a separate PR.
Double-compute edge case: When spec-decode produces only EOS tokens and AR-decode also produces only EOS, both paths ran full compute. Rare but wasteful. Could add a check: if spec-decode hit immediate EOS, skip the AR retry since the model genuinely wants to stop.

Minor nit

POLLRDHUP falls back to 0 on macOS. The recv+MSG_PEEK fallback still works, but macOS won't get the optimized early-half-close detection. Acceptable for correctness.

OmarB97 and others added 8 commits May 30, 2026 11:17

docs: document multi-GPU flags and env vars for PFlash/DFlash

7d3f873

Fixes Luce-Org#102 Adds --target-gpu / --draft-gpu CLI flags and DFLASH_TARGET_GPU / DFLASH_DRAFT_GPU environment variables to the README server configuration table. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(common): retry empty spec decode through AR

f4a0d8b

Merge pull request Luce-Org#314 from OmarB97/fix/specdecode-empty-ar-…

17bbff9

…fallback

Merge pull request Luce-Org#317 from Oxygen56/docs/issue-102

8305b6c

fix: retry dflash generations with no visible output

34da50f

fix: cancel disconnected dflash requests

f540b29

fix: preserve compute failures during cancellation

47fd712

OmarB97 mentioned this pull request May 31, 2026

[codex] Retry dflash generations with no visible output Luce-Org/lucebox-hub#324

Closed

OmarB97 merged commit d9eb16b into main May 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[hermes-dflash-phone-cli][2/n] Preserve compute failures during cancellation#9

[hermes-dflash-phone-cli][2/n] Preserve compute failures during cancellation#9
OmarB97 merged 8 commits into
mainfrom
codex/visible-empty-dflash-retry-upstream

OmarB97 commented May 31, 2026

Uh oh!

OmarB97 commented May 31, 2026

Uh oh!

OmarB97 commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OmarB97 commented May 31, 2026

Uh oh!

OmarB97 commented May 31, 2026

Uh oh!

OmarB97 commented Jun 2, 2026

Post-merge reviewer dispatch (hermes-local-longctx-ko-mac)

What works well

Non-blocking suggestions

Minor nit

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants