fix: retry once on transient connection errors before failing by Evrard-Nil · Pull Request #520 · nearai/cloud-api

Evrard-Nil · 2026-03-31T03:42:46Z

Summary

Add a single retry with 500ms delay when all providers fail with connection or server errors (5xx)
Most models have only 1 provider (via model-proxy), so the existing provider fallback was ineffective
4xx client errors are still not retried (unchanged behavior)

Root cause

QEMU SLIRP has a hardcoded listen backlog of 1, brief nginx reloads during config updates, and Docker bridge churn during container restarts all cause transient TCP connection failures. With only 1 provider per model, these fail immediately with no recovery.

Top affected models (24h):

openai/gpt-oss-120b: ~6,000 errors
zai-org/GLM-5-FP8: ~2,600 errors
Qwen/Qwen3.5-122B-A10B: ~580 errors

Reproduction steps

# Send 20 concurrent requests to stress the model-proxy connection path
# QEMU SLIRP backlog=1 means only 1 pending connect() at a time
for i in $(seq 1 20); do
  curl -s --max-time 30 -X POST "https://cloud-api.near.ai/v1/chat/completions" \
    -H "Authorization: Bearer $API_KEY" \
    -H "Content-Type: application/json" \
    -d '{
      "model": "zai-org/GLM-5-FP8",
      "messages": [{"role": "user", "content": "hi"}],
      "max_tokens": 5
    }' &
done
wait

# Check Datadog for "All providers failed" errors
# Query: service:cloud-api env:prod @level:ERROR "All providers failed"

See repro_connection_retry.sh (gitignored) for the full reproduction script.

Test plan

cargo check compiles cleanly
All 188 unit tests pass (cargo test --lib --bins)
Deploy to staging:
- Verify retry log messages appear: "Retrying after transient connection failure"
- Verify 4xx errors are NOT retried
- Verify successful retries show round=2 in success log
- Monitor latency: retry adds max 500ms to failed requests (not to successful ones)

🤖 Generated with Claude Code

Most models route through a single provider (model-proxy), so provider fallback alone doesn't help. Add a single retry with 500ms delay for connection failures and 5xx errors. This handles transient issues like QEMU SLIRP listen backlog=1, brief nginx reloads, and Docker bridge churn that cause ~9.5k "All providers failed" errors/day in prod. 4xx client errors are still not retried (unchanged behavior).

gemini-code-assist

Code Review

This pull request implements a retry mechanism for transient connection errors in the inference provider pool by wrapping the provider selection logic in a retry loop. Feedback highlights that the calculation of total_attempts in the error logs is inaccurate, as it fails to account for early exits from the retry loop when encountering non-retryable errors.

crates/services/src/inference_provider_pool/mod.rs

claude · 2026-03-31T03:46:12Z

Code Review

The retry logic is well-motivated. Two issues worth addressing before merge:

Issue 1 (functional): Failure counter double-counted across retry rounds

A provider that fails in both round 0 and round 1 has its failure counter incremented twice for a single client request. With MAX_CONSECUTIVE_FAILURES = 10, a provider reaches demotion after only 5 failing requests instead of 10.

For the common case (1 provider per model), each failed request under load now increments the counter twice, halving the effective demotion threshold. Fix: track already-counted provider keys in a local HashSet within try_with_providers, and only increment the failure counter once per provider per request regardless of retry rounds.

Issue 2 (minor): total_attempts log value is incorrect when retry does not occur

The expression providers.len() * MAX_ROUNDS.min(if last_error.is_some() { MAX_ROUNDS } else { 1 }) always evaluates to providers.len() * MAX_ROUNDS since last_error is always Some at this point. If round 0 fails with a non-retryable error (e.g., 429) and we break early, total_attempts still logs providers.len() * 2 instead of the actual providers.len() * 1. Minor but misleading in error logs.

No other critical issues. The is_retryable check using the last provider error is an acceptable simplification given most models have a single provider.

Warning: Issue 1 should be fixed before merge to avoid distorting provider health demotion tracking.

Copilot

Pull request overview

Adds a single “second round” retry (after a 500ms delay) to InferenceProviderPool::retry_with_fallback to mitigate transient connection/5xx failures when a model effectively has only one usable provider (e.g., via model-proxy), improving resilience before surfacing “All providers failed”.

Changes:

Introduce up to 2 retry rounds with a fixed 500ms delay between rounds.
Extend tracing fields/logs to include round and total_attempts when all providers fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

crates/services/src/inference_provider_pool/mod.rs

- Narrow retry gate: only retry CompletionError::CompletionError when the message contains connection-related keywords (connection, timeout, reset, broken pipe). Non-transient errors like JSON parse failures are no longer retried. - Track actual attempts with a counter instead of computing from MAX_ROUNDS (which was wrong when the loop broke early). - Add 4 unit tests: connection error retries, non-connection error doesn't retry, 5xx retries, retry succeeds on second attempt.

Copilot AI review requested due to automatic review settings March 31, 2026 03:42

Evrard-Nil had a problem deploying to Cloud API test env March 31, 2026 03:42 — with GitHub Actions Failure

Copilot started reviewing on behalf of Evrard-Nil March 31, 2026 03:43 View session

gemini-code-assist bot reviewed Mar 31, 2026

View reviewed changes

crates/services/src/inference_provider_pool/mod.rs Outdated Show resolved Hide resolved

fix: cargo fmt

6e83915

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive

chore: gitignore repro scripts

ad01dab

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 03:47 — with GitHub Actions Inactive

Copilot AI reviewed Mar 31, 2026

View reviewed changes

crates/services/src/inference_provider_pool/mod.rs Outdated Show resolved Hide resolved

crates/services/src/inference_provider_pool/mod.rs Outdated Show resolved Hide resolved

crates/services/src/inference_provider_pool/mod.rs Show resolved Hide resolved

Evrard-Nil mentioned this pull request Mar 31, 2026

fix: return specific error for stale E2EE attestation keys #521

Open

3 tasks

PierreLeGuen approved these changes Mar 31, 2026

View reviewed changes

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 19:56 — with GitHub Actions Inactive

Merge branch 'main' into fix/retry-transient-connection-errors

2f960c2

Evrard-Nil temporarily deployed to Cloud API test env March 31, 2026 21:49 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: retry once on transient connection errors before failing#520

fix: retry once on transient connection errors before failing#520
Evrard-Nil wants to merge 5 commits intomainfrom
fix/retry-transient-connection-errors

Evrard-Nil commented Mar 31, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

claude bot commented Mar 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Evrard-Nil commented Mar 31, 2026

Summary

Root cause

Reproduction steps

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

claude bot commented Mar 31, 2026

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants