fix: retry once on transient connection errors before failing#520
fix: retry once on transient connection errors before failing#520Evrard-Nil wants to merge 5 commits intomainfrom
Conversation
Most models route through a single provider (model-proxy), so provider fallback alone doesn't help. Add a single retry with 500ms delay for connection failures and 5xx errors. This handles transient issues like QEMU SLIRP listen backlog=1, brief nginx reloads, and Docker bridge churn that cause ~9.5k "All providers failed" errors/day in prod. 4xx client errors are still not retried (unchanged behavior).
There was a problem hiding this comment.
Code Review
This pull request implements a retry mechanism for transient connection errors in the inference provider pool by wrapping the provider selection logic in a retry loop. Feedback highlights that the calculation of total_attempts in the error logs is inaccurate, as it fails to account for early exits from the retry loop when encountering non-retryable errors.
Code ReviewThe retry logic is well-motivated. Two issues worth addressing before merge: Issue 1 (functional): Failure counter double-counted across retry rounds A provider that fails in both round 0 and round 1 has its failure counter incremented twice for a single client request. With MAX_CONSECUTIVE_FAILURES = 10, a provider reaches demotion after only 5 failing requests instead of 10. For the common case (1 provider per model), each failed request under load now increments the counter twice, halving the effective demotion threshold. Fix: track already-counted provider keys in a local HashSet within try_with_providers, and only increment the failure counter once per provider per request regardless of retry rounds. Issue 2 (minor): total_attempts log value is incorrect when retry does not occur The expression No other critical issues. The is_retryable check using the last provider error is an acceptable simplification given most models have a single provider. Warning: Issue 1 should be fixed before merge to avoid distorting provider health demotion tracking. |
There was a problem hiding this comment.
Pull request overview
Adds a single “second round” retry (after a 500ms delay) to InferenceProviderPool::retry_with_fallback to mitigate transient connection/5xx failures when a model effectively has only one usable provider (e.g., via model-proxy), improving resilience before surfacing “All providers failed”.
Changes:
- Introduce up to 2 retry rounds with a fixed 500ms delay between rounds.
- Extend tracing fields/logs to include
roundandtotal_attemptswhen all providers fail.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Narrow retry gate: only retry CompletionError::CompletionError when the message contains connection-related keywords (connection, timeout, reset, broken pipe). Non-transient errors like JSON parse failures are no longer retried. - Track actual attempts with a counter instead of computing from MAX_ROUNDS (which was wrong when the loop broke early). - Add 4 unit tests: connection error retries, non-connection error doesn't retry, 5xx retries, retry succeeds on second attempt.
Summary
Root cause
QEMU SLIRP has a hardcoded listen backlog of 1, brief nginx reloads during config updates, and Docker bridge churn during container restarts all cause transient TCP connection failures. With only 1 provider per model, these fail immediately with no recovery.
Top affected models (24h):
openai/gpt-oss-120b: ~6,000 errorszai-org/GLM-5-FP8: ~2,600 errorsQwen/Qwen3.5-122B-A10B: ~580 errorsReproduction steps
See
repro_connection_retry.sh(gitignored) for the full reproduction script.Test plan
cargo checkcompiles cleanlycargo test --lib --bins)"Retrying after transient connection failure"round=2in success log🤖 Generated with Claude Code