[codex] Retry dflash generations with no visible output by OmarB97 · Pull Request #324 · Luce-Org/lucebox-hub

OmarB97 · 2026-05-31T15:46:28Z

Summary

The cached dflash path could count an EOS/EOT-only spec-decode result as a successful completion. That produced no visible streamed or non-streamed text while still allowing prefix-cache confirmation/continuation to treat the decode as valid.

This adds an explicit GenerateResult::empty_visible_output signal, sets it for Qwen35 EOS/EOT-only spec-decode results, retries those results through AR decode in the common backend wrapper, and gates HTTP prefix-cache confirmation on visible emitted output rather than raw completion token count.

This upstream branch is based directly on Luce-Org/main and contains only the visible-output fix. The fork-side integration PR remains OmarB97#7.

Validation

Built test_server_unit on taro with CUDA 13.3
/home/omar/ai/lucebox-worktrees/visible-empty-dflash-retry/server/build/test_server_unit (1634 assertions, 0 failures)
Built and installed patched dflash_server to /home/omar/ai/lucebox-hub/server/build/dflash_server on taro
Direct live smoke through llama-swap: dflash returned OK with visible content

cubic-dev-ai

No issues found across 4 files

_{Re-trigger cubic}

OmarB97 · 2026-05-31T17:02:08Z

Updated this PR with the dflash disconnect root-cause fix found during the 2026-05-31 Hermes hang investigation.

Root cause: the server only observed client disconnects when SSE writes happened. If Hermes/client disconnected while dflash was still in pre-header prompt handling, prefill, or first-token work, the backend kept CPU/GPU work alive and the single worker stayed unavailable.

Changes:

added request-level socket watching before render/tokenize/enqueue
made tokenizer encode cooperatively cancellable
propagated DaemonIO cancellation through backend prefill/spec-decode/AR paths
fixed the empty-visible retry test to call the retry wrapper

Verification on taro:

/tmp/lucebox-dflash-cancel-test: cmake --build server/build --target test_server_unit -j 8 && ./test_server_unit
result: 1641 assertions, 0 failures
/tmp/lucebox-dflash-cancel-test: cmake --build server/build --target dflash_server -j 8

cubic-dev-ai

1 issue found across 15 files (changes from recent commits).

_{Reply with feedback, questions, or to request a fix.

Re-trigger cubic}

Merge advanced PR Luce-Org#324 head 47fd712 after the draft-residency refresh. Preserve the auto-integration Qwen35MoE gallocr cleanup path while adding Luce-Org#324's cancellation-aware one-token processing helper and empty-output retry updates.

Record the post-push Luce-Org#324 head advance and conflict resolution after integrating Luce-Org#291/Luce-Org#290.

OmarB97 · 2026-05-31T17:52:52Z

Addressed the unresolved review issue from cubic about cancellation masking draft backend compute failures.

What changed:

Added a shared completed-compute classifier so ggml_backend_graph_compute() failures win over pending cancellation.
Applied it to the common DFlash spec-decode path and sibling qwen35/qwen35moe cancellation-after-compute sites.
Added unit coverage for failure-vs-cancel ordering.

Verification:

Clean taro clone /tmp/lucebox-pr324-review-fix: CMake configure/build for test_server_unit succeeded.
server/build/test_server_unit passed: 1647 assertions, 0 failures.

Mirror tracking:

The prior fork mirror PR fix: cancel disconnected dflash requests OmarB97/lucebox-hub#8 had already merged before this review-fix commit, so I opened replacement mirror PR [hermes-dflash-phone-cli][2/n] Preserve compute failures during cancellation OmarB97/lucebox-hub#9 for the current head.

Mirror follow-up for Luce-Org#324. Keep codex/visible-empty-dflash-retry-upstream available for the upstream PR.

OmarB97 · 2026-06-01T18:40:58Z

Reviewer notes (Hermes dispatch review)

Overall: Good scope and approach. The cooperative cancellation infrastructure, disconnect detection, and empty-visible-output retry all address real failure modes from the "sudden death" reports.

Concerns (non-blocking)

1. `qwen35_empty_visible_output` returns `false` for empty token vector (`qwen35_backend.cpp:43-49`)

The function returns `false` when `tokens.empty()`, meaning `empty_visible_output` is never set when zero tokens are generated. The broader retry condition handles this via `result.tokens.empty() || result.empty_visible_output`, so it is not broken in practice — but the flag is semantically wrong for the empty case. Consider returning `true` for empty tokens.

2. Potential data race on `DaemonIO::cancelled` (`daemon_loop.cpp:32-39`)

`should_cancel()` is `const` and writes to `mutable bool cancelled`. The `cancelled` field is also written from the generation thread (via `on_token` returning false) while `should_cancel()` itself may be called from the same thread or from the disconnect watcher thread. If two threads call `should_cancel()` concurrently after `cancelled=false`, both can observe the same value and both can write `cancelled = true` — classic data race on non-atomic bool. Consider making `cancelled` a `std::atomic` or adding a mutex.

3. `RequestDisconnectWatcher` performance under load (`http_server.cpp:84-124`)

Every request spawns a dedicated thread that polls the socket every 100ms. Under high concurrency (many concurrent SSE streams), this adds significant thread overhead. For short requests (count_tokens, small prompts), the thread may spin only once or twice before being joined. Consider a shared poller or edge-triggered epoll/kqueue approach for production deployments.

4. ``POLLRDHUP`` fallback to 0 on macOS (`http_server.cpp:27-29`)

On macOS (which lacks POLLRDHUP), the poll flags silently drop the clean-disconnect detection. The fallback to recv(MSG_PEEK) still works, but the semantics are different: POLLRDHUP fires immediately on FIN, while MSG_PEEK only detects it after the peer has actually sent 0 bytes. This means macOS has slightly slower disconnect detection. Consider documenting this or using kqueue-style detection on macOS.

5. Test: `test_tokenizer_encode_honors_cancel_callback` (`test_server_unit.cpp:2469-2485`)

The test uses an empty `Tokenizer()` with no vocabulary loaded. The callback always returns `true` (always cancelled). The test expects a `TokenizationCancelled` exception, but an empty tokenizer may return early with an empty vector before the cancellation check fires (depending on the fast path in `encode`). The test may pass because of the early check in `encode`, but it is fragile — it depends on the specific code path taken. Consider loading a minimal vocabulary or using a mock tokenizer.

Suggestions

The `visible_output_seen` gate on snapshot/continued/disk cache confirmation is a good change — it prevents caching responses where the client saw no visible output. However, ensure the thinking-only case (model emits only \``and`````` with no text after) is handled as intended: thinking tags DO set `visible_output_seen = true`, so a thinking-only response WILL still confirm the snapshot. This may or may not be desired.
The `classify_daemon_compute_result` priority (failure > cancel) is correct and well-documented.
The speculative decode cancellation points are well-placed — before draft compute, after verify, after restore_kv. This prevents wasted GPU time when the client has already disconnected.

OmarB97 · 2026-06-01T22:53:54Z

Addressed the Hermes dispatch-review follow-ups in 3f44fc8.

What changed:

qwen35_empty_visible_output() now treats an empty token vector as empty visible output, matching the common retry semantics.
DaemonIO::cancelled is now an atomic latch with copy/assignment support, and cancellation writes now use explicit atomic load/store.
Replaced per-request disconnect watcher threads with a shared disconnect poller for generation routes; /v1/messages/count_tokens avoids watcher registration on its short path.
Documented the macOS/BSD POLLRDHUP fallback behavior next to the poll flags.
Made test_tokenizer_encode_honors_cancel_callback cancel on the second poll so it exercises an inner tokenizer cancellation checkpoint rather than only the first guard.
Documented that thinking delimiters intentionally count as visible stream output for cache confirmation.

Verification:

Local hygiene: git diff --check passed.
Local syntax-only check passed for server/src/common/daemon_loop.cpp and server/src/server/http_server.cpp with repo include paths.
Clean taro scratch clone /tmp/lucebox-pr324-codex-reviewfix: configured CUDA explicitly with /usr/local/cuda, built test_server_unit, and ran it successfully: 1647 assertions, 0 failures.
GitHub after push: uv workspace and cubic checks are green; full build job is still running as of this comment.

fix: retry dflash generations with no visible output

34da50f

OmarB97 marked this pull request as ready for review May 31, 2026 15:50

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

fix: cancel disconnected dflash requests

f540b29

cubic-dev-ai Bot reviewed May 31, 2026

View reviewed changes

Comment thread server/src/common/dflash_spec_decode.cpp

fix: preserve compute failures during cancellation

47fd712

OmarB97 mentioned this pull request May 31, 2026

[hermes-dflash-phone-cli][2/n] Preserve compute failures during cancellation OmarB97/lucebox-hub#9

Merged

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026

docs: refresh auto-integration manifest

19a19e9

Record the post-push Luce-Org#324 head advance and conflict resolution after integrating Luce-Org#291/Luce-Org#290.

OmarB97 added a commit to OmarB97/lucebox-hub that referenced this pull request May 31, 2026

fix: preserve compute failures during cancellation

d9eb16b

Mirror follow-up for Luce-Org#324. Keep codex/visible-empty-dflash-retry-upstream available for the upstream PR.

fix: address dflash review follow-ups

3f44fc8

OmarB97 mentioned this pull request Jun 2, 2026

[codex] Recover dflash spec-decode agent stalls #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Retry dflash generations with no visible output#324

[codex] Retry dflash generations with no visible output#324
OmarB97 wants to merge 4 commits into
Luce-Org:mainfrom
OmarB97:codex/visible-empty-dflash-retry-upstream

OmarB97 commented May 31, 2026

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

OmarB97 commented May 31, 2026

Uh oh!

cubic-dev-ai Bot left a comment •

edited

Loading

Uh oh!

Uh oh!

OmarB97 commented May 31, 2026

Uh oh!

OmarB97 commented Jun 1, 2026

Uh oh!

OmarB97 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

OmarB97 commented May 31, 2026

Summary

Validation

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

OmarB97 commented May 31, 2026

Uh oh!

cubic-dev-ai Bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OmarB97 commented May 31, 2026

Uh oh!

OmarB97 commented Jun 1, 2026

Reviewer notes (Hermes dispatch review)

Concerns (non-blocking)

Suggestions

Uh oh!

OmarB97 commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cubic-dev-ai Bot left a comment •

edited

Loading