Skip to content

[hermes-dflash-phone-cli][2/n] Preserve compute failures during cancellation#9

Merged
OmarB97 merged 8 commits into
mainfrom
codex/visible-empty-dflash-retry-upstream
May 31, 2026
Merged

[hermes-dflash-phone-cli][2/n] Preserve compute failures during cancellation#9
OmarB97 merged 8 commits into
mainfrom
codex/visible-empty-dflash-retry-upstream

Conversation

@OmarB97

@OmarB97 OmarB97 commented May 31, 2026

Copy link
Copy Markdown
Owner

Mirror follow-up for Luce-Org#324 after #8 had already merged.\n\nThis covers the review-fix commit 47fd712 only: preserve completed ggml graph_compute failures ahead of cancellation checks so disconnect handling cannot mask a backend compute failure.\n\nVerification:\n- taro clean clone: cmake configure/build test_server_unit\n- taro clean clone: server/build/test_server_unit => 1647 assertions, 0 failures\n\nUpstream review context: cubic identified the masking issue on Luce-Org#324.

OmarB97 and others added 8 commits May 30, 2026 11:17
dflash speculative decoding emits EOS as the very first token on certain
agentic "decision" turns, returning an empty completion (0 tokens,
finish=stop, no tool_call). The agent loop then stalls with nothing to
run -- the user-visible "dflash stops the moment it needs to do
something." At temperature 0 spec-decode must equal AR greedy, so this is
a spec-decode correctness bug (the batched target verify diverges from AR
on the first emitted position for these contexts).

Root cause isolated with a reproducible eval (stateless dflash-nocache
lane, two full passes byte-identical, jaccard 1.0): 9 of 55 real captured
agentic turns produce empty output under spec-decode, deterministically.
The SAME turns produce correct non-empty output on the AR path (no
draft/ddtree) and on stock llama.cpp Q6 -- so it is specific to the
dflash spec-decode path, not the prompt or the model weights.

Fix: at the two do_spec_decode call sites, if it returns success but
emitted zero tokens, fall back to do_ar_decode. The trigger is strictly
"0 tokens emitted", so healthy turns (which emit >=1 token) never reach
the fallback -- blast radius is exactly the currently-100%-failing turns.
do_ar_decode is the existing autoregressive path, verified correct here.

Validated (all turns replayed in isolation on the reproducible lane):
- 9/9 empty turns now produce non-empty output
- those 9 outputs are BYTE-IDENTICAL to the AR lane (fallback resumes
  from correct state)
- full 55-turn sweep: empty count 9 -> 0, zero regressions to empty,
  tool-call set unchanged
- prefix-cache oracle still 6/6 bit-identical
- 12-turn consecutive sequence on the stateful cache-on lane: 0 empties
  (no cross-turn state poisoning)

Spec-decode speed is retained for every non-degenerate turn; the slower
AR path runs only on the rare empty case (which was already failing).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fixes Luce-Org#102

Adds --target-gpu / --draft-gpu CLI flags and DFLASH_TARGET_GPU /
DFLASH_DRAFT_GPU environment variables to the README server
configuration table.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@OmarB97

OmarB97 commented May 31, 2026

Copy link
Copy Markdown
Owner Author

This is the replacement fork mirror for the current head of Luce-Org#324.

Why this exists:

  • fix: cancel disconnected dflash requests #8 already merged before review-fix commit 47fd712.
  • The upstream PR remains open and now points at 47fd712, so the old mirror only covers the earlier commit range.
  • This PR mirrors the current upstream head for fork-side review/settlement.

Verification:

  • Clean taro clone /tmp/lucebox-pr324-review-fix: CMake configure/build for test_server_unit succeeded.
  • server/build/test_server_unit passed: 1647 assertions, 0 failures.

@OmarB97 OmarB97 merged commit d9eb16b into main May 31, 2026
@OmarB97

OmarB97 commented Jun 2, 2026

Copy link
Copy Markdown
Owner Author

Post-merge reviewer dispatch (hermes-local-longctx-ko-mac)

Verdict: APPROVE (with non-blocking suggestions below)

Reviewed the full diff across 17 files (~757 additions). This is a substantial and well-structured change. Summary of findings:

What works well

  • Consistent cancellation pattern: should_cancel() is uniformly applied across all backends (Gemma4, Laguna, Qwen3, Qwen3.5, Qwen3.5 MoE) and both spec-decode and AR paths. No backends were missed.
  • Clean separation of concerns: DaemonComputeResult enum correctly prioritizes compute failures over cancellation (a graph error is a backend bug, not a client hang-up).
  • Empty spec fallback: The retry mechanism for spec-decode producing only EOS/thinking tokens is well-designed. force_ar_decode flag cleanly bypasses spec without duplicating retry policy in each backend.
  • Visible output tracking: visible_output_seen prevents caching snapshots/checkpoints when generation produced no client-visible output. Good optimization.
  • Callback-based cancellation: CancelCallback on DaemonIO cleanly decouples HTTP server from backend layer.

Non-blocking suggestions

  1. Thread-per-request overhead: RequestDisconnectWatcher spawns one std::thread per request. Under high concurrency this could be expensive. Consider a shared poller or edge-triggered epoll approach for production workloads.
  2. Truncated tokenizer diff: The tokenizer.cpp changes were truncated by preflight. The encode(rendered, request_cancelled) call in http_server.cpp implies a new overload accepting a cancel callback — verify this compiles and the cancel callback is actually checked during tokenization.
  3. README scope creep: The GPU selection flag documentation (--target-gpu, --draft-gpu, env vars) appears unrelated to the cancellation/empty-output-retry goal. Consider landing that in a separate PR.
  4. Double-compute edge case: When spec-decode produces only EOS tokens and AR-decode also produces only EOS, both paths ran full compute. Rare but wasteful. Could add a check: if spec-decode hit immediate EOS, skip the AR retry since the model genuinely wants to stop.

Minor nit

  • POLLRDHUP falls back to 0 on macOS. The recv+MSG_PEEK fallback still works, but macOS won't get the optimized early-half-close detection. Acceptable for correctness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants