-
-
Notifications
You must be signed in to change notification settings - Fork 202
Model fallback & graceful degradation for token/rate limit exhaustion #67
Description
Problem
Overstory currently has no fallback or retry mechanism when an agent's underlying LLM provider hits token limits, rate limits (HTTP 429), or quota exhaustion. The system operates as a process-level orchestrator and delegates all API-level concerns to the CLI runtime (e.g., Claude Code). This creates several failure modes:
| Scenario | Current Behavior | Impact |
|---|---|---|
| API rate limit (429) | CLI retries internally; if stalled >5min, watchdog nudges → kills | Agent work lost, no respawn |
| Context window exhaustion | CLI does internal compaction; Overstory unaware | Agent may lose context without checkpointing |
| Quota/billing limit hit | Agent process dies | Watchdog marks zombie → no respawn, no model switch |
| Provider outage | Agent stalls indefinitely | Killed by watchdog after zombieThresholdMs |
When running at scale (10-25 concurrent agents), a single provider rate limit can cascade — multiple agents hit the ceiling simultaneously, stall, and get killed by the watchdog with no recovery.
Proposed Solution
1. Model Fallback Chain in config.yaml
Allow defining ordered fallback models per role:
models:
coordinator:
primary: opus
fallback: [sonnet, haiku]
lead:
primary: opus
fallback: [sonnet]
scout:
primary: haiku
fallback: [gemini-2.5-flash] # cross-runtime fallback
builder:
primary: sonnet
fallback: [minimax-m2.5]When resolveModel() in src/agents/manifest.ts is called, it would return the primary model. On detected failure, the system retries with the next model in the chain.
2. Failure Detection Layer
Add API-level awareness between the runtime adapter and the watchdog:
- Runtime adapters (
src/runtimes/*.ts) could expose adetectFailure()method (similar to existingdetectReady()) that parses tmux pane output for known error patterns:rate_limit,429,overloaded,quota exceeded,context_length_exceeded- Provider-specific error signatures
- Watchdog integration: Before escalating to kill, check if the failure is recoverable via model fallback
3. Automatic Agent Respawn with Degraded Model
When the watchdog detects a token/rate limit failure:
- Save agent checkpoint (if not already saved)
- Kill the current session
- Respawn with the next fallback model from the chain
- Restore from checkpoint
- Log the degradation event for observability
4. Per-Agent Token Budget (Optional)
Allow setting soft/hard token limits per agent role:
budgets:
scout:
softLimit: 50000 # warn at 50k tokens
hardLimit: 100000 # force checkpoint + stop at 100k
builder:
softLimit: 200000
hardLimit: 500000This would require real-time token counting from transcript parsing (building on existing src/metrics/transcript.ts).
Alternatives Considered
- Do nothing: Rely on CLI-level retry. Works for transient 429s but fails for quota exhaustion and provides no cross-provider fallback.
- External proxy (LiteLLM/OpenRouter): Route all API calls through a proxy that handles fallback. Adds infrastructure complexity but works without Overstory changes. However, this doesn't solve the checkpoint/respawn problem.
- Prompt-only checkpointing: Current approach — agents are instructed to checkpoint. Unreliable since it depends on the LLM following instructions.
Context
Analysis was done by reading Overstory CLI source code (src/watchdog/, src/runtimes/, src/agents/, src/config.ts). Confirmed that:
resolveModel()is a static one-time lookup at spawn- Watchdog operates on OS signals only (tmux alive/dead, PID alive/dead)
- No
fallback,retry,budget, ordegradationkeys exist in the config schema - Checkpoint system is prompt-level only, not code-enforced
This becomes especially important for multi-provider setups (e.g., Anthropic for coordinators, Fireworks for scouts, OpenAI for builders) where different providers have different rate limits and failure modes.