Skip to content

Model fallback & graceful degradation for token/rate limit exhaustion #67

@tob-as

Description

@tob-as

Problem

Overstory currently has no fallback or retry mechanism when an agent's underlying LLM provider hits token limits, rate limits (HTTP 429), or quota exhaustion. The system operates as a process-level orchestrator and delegates all API-level concerns to the CLI runtime (e.g., Claude Code). This creates several failure modes:

Scenario Current Behavior Impact
API rate limit (429) CLI retries internally; if stalled >5min, watchdog nudges → kills Agent work lost, no respawn
Context window exhaustion CLI does internal compaction; Overstory unaware Agent may lose context without checkpointing
Quota/billing limit hit Agent process dies Watchdog marks zombie → no respawn, no model switch
Provider outage Agent stalls indefinitely Killed by watchdog after zombieThresholdMs

When running at scale (10-25 concurrent agents), a single provider rate limit can cascade — multiple agents hit the ceiling simultaneously, stall, and get killed by the watchdog with no recovery.

Proposed Solution

1. Model Fallback Chain in config.yaml

Allow defining ordered fallback models per role:

models:
  coordinator:
    primary: opus
    fallback: [sonnet, haiku]
  lead:
    primary: opus
    fallback: [sonnet]
  scout:
    primary: haiku
    fallback: [gemini-2.5-flash]  # cross-runtime fallback
  builder:
    primary: sonnet
    fallback: [minimax-m2.5]

When resolveModel() in src/agents/manifest.ts is called, it would return the primary model. On detected failure, the system retries with the next model in the chain.

2. Failure Detection Layer

Add API-level awareness between the runtime adapter and the watchdog:

  • Runtime adapters (src/runtimes/*.ts) could expose a detectFailure() method (similar to existing detectReady()) that parses tmux pane output for known error patterns:
    • rate_limit, 429, overloaded, quota exceeded, context_length_exceeded
    • Provider-specific error signatures
  • Watchdog integration: Before escalating to kill, check if the failure is recoverable via model fallback

3. Automatic Agent Respawn with Degraded Model

When the watchdog detects a token/rate limit failure:

  1. Save agent checkpoint (if not already saved)
  2. Kill the current session
  3. Respawn with the next fallback model from the chain
  4. Restore from checkpoint
  5. Log the degradation event for observability

4. Per-Agent Token Budget (Optional)

Allow setting soft/hard token limits per agent role:

budgets:
  scout:
    softLimit: 50000    # warn at 50k tokens
    hardLimit: 100000   # force checkpoint + stop at 100k
  builder:
    softLimit: 200000
    hardLimit: 500000

This would require real-time token counting from transcript parsing (building on existing src/metrics/transcript.ts).

Alternatives Considered

  • Do nothing: Rely on CLI-level retry. Works for transient 429s but fails for quota exhaustion and provides no cross-provider fallback.
  • External proxy (LiteLLM/OpenRouter): Route all API calls through a proxy that handles fallback. Adds infrastructure complexity but works without Overstory changes. However, this doesn't solve the checkpoint/respawn problem.
  • Prompt-only checkpointing: Current approach — agents are instructed to checkpoint. Unreliable since it depends on the LLM following instructions.

Context

Analysis was done by reading Overstory CLI source code (src/watchdog/, src/runtimes/, src/agents/, src/config.ts). Confirmed that:

  • resolveModel() is a static one-time lookup at spawn
  • Watchdog operates on OS signals only (tmux alive/dead, PID alive/dead)
  • No fallback, retry, budget, or degradation keys exist in the config schema
  • Checkpoint system is prompt-level only, not code-enforced

This becomes especially important for multi-provider setups (e.g., Anthropic for coordinators, Fireworks for scouts, OpenAI for builders) where different providers have different rate limits and failure modes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:agent-lifecycleAgent identity, sessions, checkpoint/resume, slingdifficulty:complexCross-cutting, architectural, or high-riskfocus:reliabilityStall detection, recovery, cleanup, session DB hygienepriority:mediumUseful but not urgent, well-scoped

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions