[codex] Retry dflash generations with no visible output#324
Conversation
|
Updated this PR with the dflash disconnect root-cause fix found during the 2026-05-31 Hermes hang investigation. Root cause: the server only observed client disconnects when SSE writes happened. If Hermes/client disconnected while dflash was still in pre-header prompt handling, prefill, or first-token work, the backend kept CPU/GPU work alive and the single worker stayed unavailable. Changes:
Verification on taro:
|
There was a problem hiding this comment.
1 issue found across 15 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
Merge advanced PR Luce-Org#324 head 47fd712 after the draft-residency refresh. Preserve the auto-integration Qwen35MoE gallocr cleanup path while adding Luce-Org#324's cancellation-aware one-token processing helper and empty-output retry updates.
Record the post-push Luce-Org#324 head advance and conflict resolution after integrating Luce-Org#291/Luce-Org#290.
|
Addressed the unresolved review issue from cubic about cancellation masking draft backend compute failures. What changed:
Verification:
Mirror tracking:
|
Mirror follow-up for Luce-Org#324. Keep codex/visible-empty-dflash-retry-upstream available for the upstream PR.
Reviewer notes (Hermes dispatch review)Overall: Good scope and approach. The cooperative cancellation infrastructure, disconnect detection, and empty-visible-output retry all address real failure modes from the "sudden death" reports. Concerns (non-blocking)1. `qwen35_empty_visible_output` returns `false` for empty token vector (`qwen35_backend.cpp:43-49`) The function returns `false` when `tokens.empty()`, meaning `empty_visible_output` is never set when zero tokens are generated. The broader retry condition handles this via `result.tokens.empty() || result.empty_visible_output`, so it is not broken in practice — but the flag is semantically wrong for the empty case. Consider returning `true` for empty tokens. 2. Potential data race on `DaemonIO::cancelled` (`daemon_loop.cpp:32-39`) `should_cancel()` is `const` and writes to `mutable bool cancelled`. The `cancelled` field is also written from the generation thread (via `on_token` returning false) while `should_cancel()` itself may be called from the same thread or from the disconnect watcher thread. If two threads call `should_cancel()` concurrently after `cancelled=false`, both can observe the same value and both can write `cancelled = true` — classic data race on non-atomic bool. Consider making `cancelled` a `std::atomic` or adding a mutex. 3. `RequestDisconnectWatcher` performance under load (`http_server.cpp:84-124`) Every request spawns a dedicated thread that polls the socket every 100ms. Under high concurrency (many concurrent SSE streams), this adds significant thread overhead. For short requests (count_tokens, small prompts), the thread may spin only once or twice before being joined. Consider a shared poller or edge-triggered epoll/kqueue approach for production deployments. 4. ``POLLRDHUP`` fallback to 0 on macOS (`http_server.cpp:27-29`) On macOS (which lacks POLLRDHUP), the poll flags silently drop the clean-disconnect detection. The fallback to recv(MSG_PEEK) still works, but the semantics are different: POLLRDHUP fires immediately on FIN, while MSG_PEEK only detects it after the peer has actually sent 0 bytes. This means macOS has slightly slower disconnect detection. Consider documenting this or using kqueue-style detection on macOS. 5. Test: `test_tokenizer_encode_honors_cancel_callback` (`test_server_unit.cpp:2469-2485`) The test uses an empty `Tokenizer()` with no vocabulary loaded. The callback always returns `true` (always cancelled). The test expects a `TokenizationCancelled` exception, but an empty tokenizer may return early with an empty vector before the cancellation check fires (depending on the fast path in `encode`). The test may pass because of the early check in `encode`, but it is fragile — it depends on the specific code path taken. Consider loading a minimal vocabulary or using a mock tokenizer. Suggestions
|
|
Addressed the Hermes dispatch-review follow-ups in What changed:
Verification:
|
Summary
The cached dflash path could count an EOS/EOT-only spec-decode result as a successful completion. That produced no visible streamed or non-streamed text while still allowing prefix-cache confirmation/continuation to treat the decode as valid.
This adds an explicit
GenerateResult::empty_visible_outputsignal, sets it for Qwen35 EOS/EOT-only spec-decode results, retries those results through AR decode in the common backend wrapper, and gates HTTP prefix-cache confirmation on visible emitted output rather than raw completion token count.This upstream branch is based directly on
Luce-Org/mainand contains only the visible-output fix. The fork-side integration PR remains OmarB97#7.Validation
test_server_unitontarowith CUDA 13.3/home/omar/ai/lucebox-worktrees/visible-empty-dflash-retry/server/build/test_server_unit(1634 assertions, 0 failures)dflash_serverto/home/omar/ai/lucebox-hub/server/build/dflash_serverontarollama-swap:dflashreturnedOKwith visible content