Add real-time /status dashboard with SSE push#322
Open
howard0su wants to merge 7 commits into
Open
Conversation
Add a real-time server status dashboard accessible at GET /status: - Serves standalone HTML from server/share/status.html (editable without recompile) - GET /status/events SSE endpoint pushes live JSON updates per spec-decode step - GET /status/json provides a snapshot for non-SSE clients Status tracking (server_status.h): - Current phase (idle/prefill/decode) with prompt excerpt and token counts - Draft tokens being verified (updated each spec-decode step) - Performance history (last 50 requests): prefill tok/s, decode tok/s, accept rate - RAII StatusGuard ensures status resets to idle on all exit paths Backend instrumentation (InferenceObserver on DaemonIO): - Observer callback in model_backend.h, called at each draft/verify step - Instrumented in qwen35_backend.cpp and generic dflash_spec_decode.cpp - Zero overhead when no SSE clients are connected (empty std::function check) Dashboard features (status.html): - Dark-themed responsive UI with phase badges and live counters - Draft token display updated per spec-decode step - SVG-based performance charts (prefill tok/s, decode tok/s, accept rate) - Auto-reconnecting EventSource with connection status indicator - No external CDN dependencies Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add CMake POST_BUILD rule to copy share/status.html into build/share/ - Add exe_dir/share/status.html as a search path (build dir layout) - Keeps existing ../share/ and ./share/ fallbacks Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Add request params to status: model, format, temperature, top_p/k, max_output, thinking_enabled, session_id, cache/pflash/spec_decode flags - Add incremental 'event: token' SSE events (browser accumulates output) - Add messages JSON to status event (sent once per request) - Redesigned HTML: two-column request/response view, params grid, feature tags (cache hit, pflash, spec decode, stream, thinking), live tok/s - All state accumulated client-side; server stays stateless for output text Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Token text can contain partial UTF-8 sequences (tokens split multi-byte codepoints). Use json::error_handler_t::replace in all dump() calls on status paths so invalid bytes become U+FFFD instead of throwing type_error 316. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
8 issues found across 8 files
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
May 31, 2026
Record the post-push discovery and merge of PR Luce-Org#322, the conflict resolution with the MTP hook, and updated validation/classification.
- Add /status/json to kApiEndpoints registry - Replace raw ::send() with sse_try_send() helper that handles partial writes via poll loop with a short 1s timeout (avoids stalling worker) - Add sse_heartbeat() to prune disconnected SSE clients during idle periods (worker dequeue uses timed wait, sends heartbeat every 30s) - Use $<TARGET_FILE_DIR:dflash_server> in CMake POST_BUILD copy rule for correct output path with multi-config generators - Add install(FILES) rule for status.html - Clear messages panel when a new request starts in the browser - Use incremental DOM append (createTextNode) for token events instead of re-rendering full output text on each token Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a prefix cache hit occurs, the backend only prefills the delta tokens beyond the cached prefix. The previous calculation divided the full prompt token count by the delta prefill time, giving either 0 (full cache hit, no delta) or a wildly inflated number (partial hit). Now uses the actual number of tokens that were prefilled: - Full cache hit: 0 tok/s (correct — no prefill work done) - Partial cache hit: delta_tokens / prefill_time - No cache hit: effective_prompt.size() / prefill_time Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Reply with feedback, questions, or to request a fix.
Re-trigger cubic
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record the 2026-06-01 06:09 auto-integration refresh, including exact integration of current PR Luce-Org#322 and Luce-Org#294 heads, fresh direct-merge probes for remaining selective-port candidates, and validation results.
Replace blocking sse_try_send() (1s timeout per client) with MSG_DONTWAIT send in sse_heartbeat(). The 12-byte heartbeat ping will succeed instantly for any healthy client; slow clients with full buffers are pruned immediately instead of stalling the worker thread. This eliminates up to N×1s latency on idle-to-active transitions when slow SSE clients are connected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Merge advanced status dashboard head 4b40aa1 after post-push enumeration detected the PR moved during the cron run.
easel
pushed a commit
to easel/lucebox-hub
that referenced
this pull request
Jun 1, 2026
Record post-push detection and integration of the advanced PR Luce-Org#322 head plus validation for the refreshed stack.
Contributor
There was a problem hiding this comment.
1 issue found across 1 file (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="server/src/server/http_server.cpp">
<violation number="1" location="server/src/server/http_server.cpp:591">
P1: Heartbeat non-blocking send does not handle partial writes, risking SSE stream corruption</violation>
</file>
Tip: Review your code locally with the cubic CLI to iterate faster.
Re-trigger cubic
| for (int fd : sse_fds_) { | ||
| // Non-blocking send: if the socket buffer can't accept 12 bytes | ||
| // immediately, the client is too far behind — treat as dead. | ||
| ssize_t n = ::send(fd, ping, sizeof(ping) - 1, MSG_NOSIGNAL | MSG_DONTWAIT); |
Contributor
There was a problem hiding this comment.
P1: Heartbeat non-blocking send does not handle partial writes, risking SSE stream corruption
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At server/src/server/http_server.cpp, line 591:
<comment>Heartbeat non-blocking send does not handle partial writes, risking SSE stream corruption</comment>
<file context>
@@ -580,12 +580,16 @@ void HttpServer::broadcast_token(const std::string & text) {
- if (!sse_try_send(fd, ping, sizeof(ping) - 1)) {
+ // Non-blocking send: if the socket buffer can't accept 12 bytes
+ // immediately, the client is too far behind — treat as dead.
+ ssize_t n = ::send(fd, ping, sizeof(ping) - 1, MSG_NOSIGNAL | MSG_DONTWAIT);
+ if (n <= 0) {
dead.push_back(fd);
</file context>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a live server status page at GET /status that shows the current inference state, request details, generated output, and performance history — all pushed in real time via Server-Sent Events.
What it does
Dashboard shows