Skip to content

fix: retry once on transient connection errors before failing#520

Open
Evrard-Nil wants to merge 5 commits intomainfrom
fix/retry-transient-connection-errors
Open

fix: retry once on transient connection errors before failing#520
Evrard-Nil wants to merge 5 commits intomainfrom
fix/retry-transient-connection-errors

Conversation

@Evrard-Nil
Copy link
Copy Markdown
Contributor

Summary

  • Add a single retry with 500ms delay when all providers fail with connection or server errors (5xx)
  • Most models have only 1 provider (via model-proxy), so the existing provider fallback was ineffective
  • 4xx client errors are still not retried (unchanged behavior)

Root cause

QEMU SLIRP has a hardcoded listen backlog of 1, brief nginx reloads during config updates, and Docker bridge churn during container restarts all cause transient TCP connection failures. With only 1 provider per model, these fail immediately with no recovery.

Top affected models (24h):

  • openai/gpt-oss-120b: ~6,000 errors
  • zai-org/GLM-5-FP8: ~2,600 errors
  • Qwen/Qwen3.5-122B-A10B: ~580 errors

Reproduction steps

# Send 20 concurrent requests to stress the model-proxy connection path
# QEMU SLIRP backlog=1 means only 1 pending connect() at a time
for i in $(seq 1 20); do
  curl -s --max-time 30 -X POST "https://cloud-api.near.ai/v1/chat/completions" \
    -H "Authorization: Bearer $API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "zai-org/GLM-5-FP8",
      "messages": [{"role": "user", "content": "hi"}],
      "max_tokens": 5
    }' &
done
wait

# Check Datadog for "All providers failed" errors
# Query: service:cloud-api env:prod @level:ERROR "All providers failed"

See repro_connection_retry.sh (gitignored) for the full reproduction script.

Test plan

  • cargo check compiles cleanly
  • All 188 unit tests pass (cargo test --lib --bins)
  • Deploy to staging:
    • Verify retry log messages appear: "Retrying after transient connection failure"
    • Verify 4xx errors are NOT retried
    • Verify successful retries show round=2 in success log
    • Monitor latency: retry adds max 500ms to failed requests (not to successful ones)

🤖 Generated with Claude Code

Most models route through a single provider (model-proxy), so provider
fallback alone doesn't help. Add a single retry with 500ms delay for
connection failures and 5xx errors.

This handles transient issues like QEMU SLIRP listen backlog=1,
brief nginx reloads, and Docker bridge churn that cause ~9.5k
"All providers failed" errors/day in prod.

4xx client errors are still not retried (unchanged behavior).
Copilot AI review requested due to automatic review settings March 31, 2026 03:42
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements a retry mechanism for transient connection errors in the inference provider pool by wrapping the provider selection logic in a retry loop. Feedback highlights that the calculation of total_attempts in the error logs is inaccurate, as it fails to account for early exits from the retry loop when encountering non-retryable errors.

@claude
Copy link
Copy Markdown

claude bot commented Mar 31, 2026

Code Review

The retry logic is well-motivated. Two issues worth addressing before merge:

Issue 1 (functional): Failure counter double-counted across retry rounds

A provider that fails in both round 0 and round 1 has its failure counter incremented twice for a single client request. With MAX_CONSECUTIVE_FAILURES = 10, a provider reaches demotion after only 5 failing requests instead of 10.

For the common case (1 provider per model), each failed request under load now increments the counter twice, halving the effective demotion threshold. Fix: track already-counted provider keys in a local HashSet within try_with_providers, and only increment the failure counter once per provider per request regardless of retry rounds.

Issue 2 (minor): total_attempts log value is incorrect when retry does not occur

The expression providers.len() * MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }) always evaluates to providers.len() * MAX_ROUNDS since last_error is always Some at this point. If round 0 fails with a non-retryable error (e.g., 429) and we break early, total_attempts still logs providers.len() * 2 instead of the actual providers.len() * 1. Minor but misleading in error logs.


No other critical issues. The is_retryable check using the last provider error is an acceptable simplification given most models have a single provider.

Warning: Issue 1 should be fixed before merge to avoid distorting provider health demotion tracking.

@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a single “second round” retry (after a 500ms delay) to InferenceProviderPool::retry_with_fallback to mitigate transient connection/5xx failures when a model effectively has only one usable provider (e.g., via model-proxy), improving resilience before surfacing “All providers failed”.

Changes:

  • Introduce up to 2 retry rounds with a fixed 500ms delay between rounds.
  • Extend tracing fields/logs to include round and total_attempts when all providers fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Narrow retry gate: only retry CompletionError::CompletionError when
  the message contains connection-related keywords (connection, timeout,
  reset, broken pipe). Non-transient errors like JSON parse failures
  are no longer retried.
- Track actual attempts with a counter instead of computing from
  MAX_ROUNDS (which was wrong when the loop broke early).
- Add 4 unit tests: connection error retries, non-connection error
  doesn't retry, 5xx retries, retry succeeds on second attempt.
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 19:56 — with GitHub Actions Inactive
@Evrard-Nil Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 21:49 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants