Skip to content

feat: add exponential backoff retry for transient provider errors#9

Open
ramparte wants to merge 1 commit intomicrosoft:mainfrom
ramparte:feat/retry-on-rate-limit
Open

feat: add exponential backoff retry for transient provider errors#9
ramparte wants to merge 1 commit intomicrosoft:mainfrom
ramparte:feat/retry-on-rate-limit

Conversation

@ramparte
Copy link

Problem

Provider rate limit errors (HTTP 429) and transient failures cause sessions to crash immediately, even though these errors are inherently temporary and resolve on retry.

Solution

Added a _call_provider_with_retry() method to the streaming orchestrator that wraps all 3 provider call sites with configurable exponential backoff:

  • Retries only retryable errors: Checks LLMError.retryable flag (True for RateLimitError, ProviderUnavailableError, LLMTimeoutError)
  • Exponential backoff: base_delay * 2^attempt, capped at max_delay
  • Honors retry_after: Uses server-provided delay when available (e.g., from 429 responses)
  • Observable: Emits provider:retry events on each retry attempt
  • Configurable: Three config knobs with sensible defaults

Configuration

Config Key Default Description
retry_max_attempts 3 Maximum retry attempts (0 to disable)
retry_base_delay_seconds 1.0 Base delay for exponential backoff
retry_max_delay_seconds 30.0 Maximum delay cap

Call sites modified

  1. Non-streaming fallback (provider.complete()) - main execution loop
  2. Max-iterations fallback (provider.complete()) - graceful degradation path
  3. Streaming (provider.stream()) - primary streaming path

Test coverage

18 new tests covering all retry behaviors. All 53 tests pass (18 new + 35 existing, zero regressions).

🤖 Generated with Amplifier

Wrap all 3 provider call sites in the streaming orchestrator with
configurable exponential backoff retry logic. Retries only on
LLMError with retryable=True (RateLimitError, ProviderUnavailableError,
LLMTimeoutError). Honors retry_after from provider responses and emits
provider:retry events for observability.

Config: retry_max_attempts (default 3), retry_base_delay_seconds (1.0),
retry_max_delay_seconds (30.0).

18 new tests covering all retry behaviors, zero regressions.

🤖 Generated with [Amplifier](https://github.com/microsoft/amplifier)

Co-Authored-By: Amplifier <240397093+microsoft-amplifier@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant