Skip to content

fix: classify no-usage-stats log by stream completion state#519

Merged
Evrard-Nil merged 2 commits intomainfrom
fix/no-usage-stats-error-classification
Mar 31, 2026
Merged

fix: classify no-usage-stats log by stream completion state#519
Evrard-Nil merged 2 commits intomainfrom
fix/no-usage-stats-error-classification

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Contributor

Summary

  • Downgrade "Stream ended but no usage stats" from ERROR to WARN when stream was interrupted (client disconnect or provider error)
  • Keep ERROR only for streams that completed normally (stream_completed = true) but still lack usage — indicating a real provider bug
  • Reduces ~12,500 false-positive ERROR logs/day in prod (the #1 error by volume)

Root cause

External providers (OpenAI, Anthropic) only send usage stats in the final SSE chunk. When a client disconnects mid-stream (closes browser, network timeout), the InterceptStream is dropped before receiving the final chunk. The existing code logged ERROR for all cases without checking stream_completed.

Reproduction steps

# 1. Send a streaming request but disconnect after 1 second (simulates browser close)
timeout 1 curl -s -X POST "https://cloud-api.near.ai/v1/chat/completions" \
  -H "Authorization: Bearer $API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "openai/gpt-5.2",
    "messages": [{"role": "user", "content": "Write a long essay about mathematics"}],
    "max_tokens": 500,
    "stream": true
  }'

# 2. Check Datadog: service:cloud-api env:prod @level:ERROR "Stream ended but no usage"
#    Before fix: ERROR log
#    After fix: WARN log with stream_error=false (client disconnect)

See repro_no_usage_stats.sh (gitignored) for the full reproduction script.

Test plan

  • cargo check compiles cleanly
  • All 188 unit tests pass (cargo test --lib --bins)
  • Deploy to staging and verify:
    • Client disconnects produce WARN (not ERROR) in Datadog
    • Completed streams with no usage still produce ERROR
    • Normal completions still record usage correctly

🤖 Generated with Claude Code

Client disconnects and provider errors are expected to lack usage
stats since external providers (OpenAI, Anthropic) only send usage
in the final SSE chunk. These now log at WARN level instead of ERROR.

Only truly completed streams that lack usage stats remain ERROR,
which indicates an actual provider bug worth investigating.

Reduces ~12k false-positive ERROR logs/day in prod.
Copilot AI review requested due to automatic review settings March 31, 2026 03:39
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:39 — with GitHub Actions Inactive
@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Review

The fix is correct and well-scoped. The stream_completed flag is set only on clean stream termination (Poll::Ready(None)), and last_error is set on provider-side errors — so the conditional correctly separates expected interruptions (client disconnect, provider error mid-stream) from genuine provider bugs (stream fully completed but no usage stats emitted).

One minor observation:

The (Some(usage), None) arm at line ~194 still logs tracing::error!(... "Stream ended but no chat_id available") unconditionally, without checking stream_completed. If a client disconnects after usage stats are received but before chat_id is set, this would still produce a false-positive ERROR. Consider applying the same pattern there for consistency:

(Some(usage), None) => {
    if !self.stream_completed {
        tracing::warn!(..., "Stream interrupted after receiving usage stats but before chat_id");
    } else {
        tracing::error!(..., "Stream completed but no chat_id available");
    }
    return;
}

This is minor and may be less common in practice, but worth considering for full consistency.

Otherwise the change is clean — good use of structured logging with stream_error = self.last_error.is_some() to distinguish clean disconnects from provider errors within WARN logs.

✅ Approved (minor suggestion above is non-blocking)

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the .gitignore file to exclude reproduction scripts and refines the logging logic within the completions service. The changes distinguish between expected stream interruptions, such as client disconnects or provider errors, and cases where a stream completes normally but lacks required usage statistics or chat IDs. Interruptions are now logged as warnings instead of errors to reduce noise. I have no feedback to provide as there were no review comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts streaming drop logging to reduce false-positive ERROR volume by classifying “no usage stats” cases based on whether the stream completed normally, aligning operational alerting with expected client disconnect/provider interruption behavior.

Changes:

  • Downgrade “no usage stats” logs to WARN when stream_completed = false; keep ERROR only when the stream completed but usage is still missing.
  • Add additional structured field (stream_error) to WARN logs to distinguish provider error vs. client disconnect.
  • Ignore local reproduction shell scripts via .gitignore.

Reviewed changes

Copilot reviewed 1 out of 2 changed files in this pull request and generated 1 comment.

File Description
crates/services/src/completions/mod.rs Classifies missing-usage logs by stream_completed and logs WARN vs. ERROR accordingly.
.gitignore Adds an ignore pattern for repro shell scripts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive
@PierreLeGuen PierreLeGuen self-requested a review March 31, 2026 04:17
@Evrard-Nil Evrard-Nil merged commit 43b8b0d into main Mar 31, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants