Skip to content

[codex] Retry dflash generations with no visible output#324

Open
OmarB97 wants to merge 4 commits into
Luce-Org:mainfrom
OmarB97:codex/visible-empty-dflash-retry-upstream
Open

[codex] Retry dflash generations with no visible output#324
OmarB97 wants to merge 4 commits into
Luce-Org:mainfrom
OmarB97:codex/visible-empty-dflash-retry-upstream

Conversation

@OmarB97
Copy link
Copy Markdown
Contributor

@OmarB97 OmarB97 commented May 31, 2026

Summary

The cached dflash path could count an EOS/EOT-only spec-decode result as a successful completion. That produced no visible streamed or non-streamed text while still allowing prefix-cache confirmation/continuation to treat the decode as valid.

This adds an explicit GenerateResult::empty_visible_output signal, sets it for Qwen35 EOS/EOT-only spec-decode results, retries those results through AR decode in the common backend wrapper, and gates HTTP prefix-cache confirmation on visible emitted output rather than raw completion token count.

This upstream branch is based directly on Luce-Org/main and contains only the visible-output fix. The fork-side integration PR remains OmarB97#7.

Validation

  • Built test_server_unit on taro with CUDA 13.3
  • /home/omar/ai/lucebox-worktrees/visible-empty-dflash-retry/server/build/test_server_unit (1634 assertions, 0 failures)
  • Built and installed patched dflash_server to /home/omar/ai/lucebox-hub/server/build/dflash_server on taro
  • Direct live smoke through llama-swap: dflash returned OK with visible content

@OmarB97 OmarB97 marked this pull request as ready for review May 31, 2026 15:50
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 4 files

Re-trigger cubic

@OmarB97
Copy link
Copy Markdown
Contributor Author

OmarB97 commented May 31, 2026

Updated this PR with the dflash disconnect root-cause fix found during the 2026-05-31 Hermes hang investigation.

Root cause: the server only observed client disconnects when SSE writes happened. If Hermes/client disconnected while dflash was still in pre-header prompt handling, prefill, or first-token work, the backend kept CPU/GPU work alive and the single worker stayed unavailable.

Changes:

  • added request-level socket watching before render/tokenize/enqueue
  • made tokenizer encode cooperatively cancellable
  • propagated DaemonIO cancellation through backend prefill/spec-decode/AR paths
  • fixed the empty-visible retry test to call the retry wrapper

Verification on taro:

  • /tmp/lucebox-dflash-cancel-test: cmake --build server/build --target test_server_unit -j 8 && ./test_server_unit
  • result: 1641 assertions, 0 failures
  • /tmp/lucebox-dflash-cancel-test: cmake --build server/build --target dflash_server -j 8

Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 15 files (changes from recent commits).

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/common/dflash_spec_decode.cpp
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Merge advanced PR Luce-Org#324 head 47fd712 after the draft-residency refresh. Preserve the auto-integration Qwen35MoE gallocr cleanup path while adding Luce-Org#324's cancellation-aware one-token processing helper and empty-output retry updates.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 31, 2026
Record the post-push Luce-Org#324 head advance and conflict resolution after integrating Luce-Org#291/Luce-Org#290.
@OmarB97
Copy link
Copy Markdown
Contributor Author

OmarB97 commented May 31, 2026

Addressed the unresolved review issue from cubic about cancellation masking draft backend compute failures.

What changed:

  • Added a shared completed-compute classifier so ggml_backend_graph_compute() failures win over pending cancellation.
  • Applied it to the common DFlash spec-decode path and sibling qwen35/qwen35moe cancellation-after-compute sites.
  • Added unit coverage for failure-vs-cancel ordering.

Verification:

  • Clean taro clone /tmp/lucebox-pr324-review-fix: CMake configure/build for test_server_unit succeeded.
  • server/build/test_server_unit passed: 1647 assertions, 0 failures.

Mirror tracking:

OmarB97 added a commit to OmarB97/lucebox-hub that referenced this pull request May 31, 2026
Mirror follow-up for Luce-Org#324. Keep codex/visible-empty-dflash-retry-upstream available for the upstream PR.
@OmarB97
Copy link
Copy Markdown
Contributor Author

OmarB97 commented Jun 1, 2026

Reviewer notes (Hermes dispatch review)

Overall: Good scope and approach. The cooperative cancellation infrastructure, disconnect detection, and empty-visible-output retry all address real failure modes from the "sudden death" reports.

Concerns (non-blocking)

1. `qwen35_empty_visible_output` returns `false` for empty token vector (`qwen35_backend.cpp:43-49`)

The function returns `false` when `tokens.empty()`, meaning `empty_visible_output` is never set when zero tokens are generated. The broader retry condition handles this via `result.tokens.empty() || result.empty_visible_output`, so it is not broken in practice — but the flag is semantically wrong for the empty case. Consider returning `true` for empty tokens.

2. Potential data race on `DaemonIO::cancelled` (`daemon_loop.cpp:32-39`)

`should_cancel()` is `const` and writes to `mutable bool cancelled`. The `cancelled` field is also written from the generation thread (via `on_token` returning false) while `should_cancel()` itself may be called from the same thread or from the disconnect watcher thread. If two threads call `should_cancel()` concurrently after `cancelled=false`, both can observe the same value and both can write `cancelled = true` — classic data race on non-atomic bool. Consider making `cancelled` a `std::atomic` or adding a mutex.

3. `RequestDisconnectWatcher` performance under load (`http_server.cpp:84-124`)

Every request spawns a dedicated thread that polls the socket every 100ms. Under high concurrency (many concurrent SSE streams), this adds significant thread overhead. For short requests (count_tokens, small prompts), the thread may spin only once or twice before being joined. Consider a shared poller or edge-triggered epoll/kqueue approach for production deployments.

4. ``POLLRDHUP`` fallback to 0 on macOS (`http_server.cpp:27-29`)

On macOS (which lacks POLLRDHUP), the poll flags silently drop the clean-disconnect detection. The fallback to recv(MSG_PEEK) still works, but the semantics are different: POLLRDHUP fires immediately on FIN, while MSG_PEEK only detects it after the peer has actually sent 0 bytes. This means macOS has slightly slower disconnect detection. Consider documenting this or using kqueue-style detection on macOS.

5. Test: `test_tokenizer_encode_honors_cancel_callback` (`test_server_unit.cpp:2469-2485`)

The test uses an empty `Tokenizer()` with no vocabulary loaded. The callback always returns `true` (always cancelled). The test expects a `TokenizationCancelled` exception, but an empty tokenizer may return early with an empty vector before the cancellation check fires (depending on the fast path in `encode`). The test may pass because of the early check in `encode`, but it is fragile — it depends on the specific code path taken. Consider loading a minimal vocabulary or using a mock tokenizer.

Suggestions

  • The `visible_output_seen` gate on snapshot/continued/disk cache confirmation is a good change — it prevents caching responses where the client saw no visible output. However, ensure the thinking-only case (model emits only \``and`````` with no text after) is handled as intended: thinking tags DO set `visible_output_seen = true`, so a thinking-only response WILL still confirm the snapshot. This may or may not be desired.
  • The `classify_daemon_compute_result` priority (failure > cancel) is correct and well-documented.
  • The speculative decode cancellation points are well-placed — before draft compute, after verify, after restore_kv. This prevents wasted GPU time when the client has already disconnected.

@OmarB97
Copy link
Copy Markdown
Contributor Author

OmarB97 commented Jun 1, 2026

Addressed the Hermes dispatch-review follow-ups in 3f44fc8.

What changed:

  • qwen35_empty_visible_output() now treats an empty token vector as empty visible output, matching the common retry semantics.
  • DaemonIO::cancelled is now an atomic latch with copy/assignment support, and cancellation writes now use explicit atomic load/store.
  • Replaced per-request disconnect watcher threads with a shared disconnect poller for generation routes; /v1/messages/count_tokens avoids watcher registration on its short path.
  • Documented the macOS/BSD POLLRDHUP fallback behavior next to the poll flags.
  • Made test_tokenizer_encode_honors_cancel_callback cancel on the second poll so it exercises an inner tokenizer cancellation checkpoint rather than only the first guard.
  • Documented that thinking delimiters intentionally count as visible stream output for cache confirmation.

Verification:

  • Local hygiene: git diff --check passed.
  • Local syntax-only check passed for server/src/common/daemon_loop.cpp and server/src/server/http_server.cpp with repo include paths.
  • Clean taro scratch clone /tmp/lucebox-pr324-codex-reviewfix: configured CUDA explicitly with /usr/local/cuda, built test_server_unit, and ran it successfully: 1647 assertions, 0 failures.
  • GitHub after push: uv workspace and cubic checks are green; full build job is still running as of this comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant