Skip to content

feat: Multi-Key API Key Rotation with Automatic Failover (KeyPool)#61

Open
ariel42 wants to merge 3 commits intoMadAppGang:mainfrom
ariel42:feat/key-pool-rotatable-status-codes
Open

feat: Multi-Key API Key Rotation with Automatic Failover (KeyPool)#61
ariel42 wants to merge 3 commits intoMadAppGang:mainfrom
ariel42:feat/key-pool-rotatable-status-codes

Conversation

@ariel42
Copy link
Copy Markdown

@ariel42 ariel42 commented Feb 16, 2026

Closes #60

Overview

Introduces KeyPool — a transparent multi-key rotation and failover system. Users supply multiple comma-separated API keys per provider, and Claudish automatically distributes requests across them with intelligent failover on rate limits, authentication errors, and transient failures.

export GEMINI_API_KEY="key1,key2,key3"
export OPENAI_API_KEY="sk-abc, sk-def, sk-ghi"

Zero configuration — works with any existing single-key setup, no CLI changes needed.

Key Design Decisions

Failover Strategy

executeWithFailover(fetchFn) wraps each handler's fetch call:

  • Rotatable errors (401, 402, 403, 408, 429, 500–504): rotate to next key, retry immediately
  • Body-detected invalid keys: inspects response body for provider-specific patterns (e.g., Gemini returns 400 + API_KEY_INVALID — not normally retryable, but the body reveals it's a key issue)
  • Network errors (thrown exceptions): rotate and retry
  • Non-retryable errors (e.g., 400 validation): return immediately without wasting remaining keys
  • All keys exhausted: return last response (preserving body for error details) or re-throw last error

Response Body-Based Invalid Key Detection

isInvalidApiKeyResponse() clones the response and checks for known patterns:

Provider Status Pattern
Gemini 400 error.details[].reason === "API_KEY_INVALID"
OpenAI 401 error.code === "invalid_api_key"
Anthropic 401 error.type === "authentication_error"
Generic varies Message-based matching ("API key not valid", "invalid credentials", etc.)

Non-JSON bodies gracefully return false — zero risk of false positives.

Resource Safety

Previous response bodies are drained between attempts to prevent connection and memory leaks. The last response body is preserved for the caller to read error details (quota info, retry-after hints, etc.).

Handler Changes

All 7 provider handlers updated to use KeyPool:

  • base-gemini-handler.ts / gemini-handler.ts
  • openai-handler.ts
  • anthropic-compat-handler.ts
  • openrouter-handler.ts
  • ollamacloud-handler.ts
  • litellm-handler.ts
  • remote-provider-handler.ts

Each handler injects the API key per-attempt via the failover callback rather than once at construction time.

Test Coverage (116 tests, 1,990 lines)

  • Core KeyPool unit tests: Initialization, round-robin rotation, index management, reset
  • executeWithFailover integration: Success/failure paths, mixed error types, concurrent calls, large pools (10+ keys), non-retryable passthrough
  • Status code coverage: Every rotatable code verified, non-rotatable codes confirmed as passthrough
  • Invalid key detection: All provider-specific patterns, negative cases, body preservation
  • 5 live canary tests: Real API calls to Gemini, OpenAI, Anthropic, OpenRouter, and OllamaCloud with invalid keys — responses piped through isInvalidApiKeyResponse() to catch provider API contract changes

Stats

37 files changed, 4,337 insertions, 321 deletions

ariel4200 and others added 3 commits February 16, 2026 09:45
…e status code handling

Implement KeyPool class providing round-robin rotation across multiple
comma-separated API keys with transparent failover. Integrate into all
provider handlers (Gemini, OpenAI, OpenRouter, Anthropic-compat,
OllamaCloud, LiteLLM, remote-provider).

Rotatable status codes (401, 402, 403, 408, 429, 500, 502, 503, 504)
trigger key rotation; all other error codes propagate immediately.

Includes 99 unit tests covering failover logic, body drain, index
advancement, provider auth patterns, and edge cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e status code handling

Implement KeyPool class providing sticky-key rotation across multiple
comma-separated API keys with transparent failover. Keys are reused
until they fail — only errors trigger advancement to the next key.

Integrate into all provider handlers (Gemini, OpenAI, OpenRouter,
Anthropic-compat, OllamaCloud, LiteLLM, remote-provider).

Rotatable status codes (401, 402, 403, 408, 429, 500, 502, 503, 504)
trigger key rotation; all other error codes propagate immediately.

Includes 99 unit tests covering failover logic, sticky-key behavior,
body drain, index advancement, provider auth patterns, and edge cases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add isInvalidApiKeyResponse() to KeyPool that inspects response bodies
for provider-specific invalid key patterns (Gemini API_KEY_INVALID,
OpenAI invalid_api_key, Anthropic authentication_error, and generic
message matching). This enables key rotation even when the HTTP status
code alone wouldn't trigger it (e.g., Gemini returns 400 for invalid keys).

- Make isInvalidApiKeyResponse public for testability
- Expand executeWithFailover to check body after status-code check
- Add 11 mocked unit tests for all detection patterns and edge cases
- Add 5 live canary tests hitting real provider APIs with invalid keys
  to detect API contract changes automatically
- Sync key-pool.ts to packages/core and src/
@erudenko
Copy link
Copy Markdown
Member

The KeyPool core is well-designed. Round-robin with failover, body-draining between attempts, preserving the last response for error details, and the ROTATABLE_STATUS_CODES set is well-chosen. The body-based invalid key detection for Gemini's unusual 400 response is smart.

But the integration into handlers needs rework:

Every handler has the same if (keyPool.keyCount() > 1) { ... failover ... } else { ... existing ... } copy-pasted. executeWithFailover already handles single-key correctly (one loop iteration), so the branching is unnecessary. It doubles the code in each handler for zero benefit. Just always use executeWithFailover.

The changes are tripled across src/, packages/cli/src/, and packages/core/src/. Only src/ is the source of truth. This inflates the PR from ~1,400 meaningful lines to 4,247.

In base-gemini-handler.ts, multi-key mode uses fetchWithRetry with maxRetries: 1, skipRetryOn429: true while single-key gets maxRetries: 5. So multi-key users get significantly less resilience on transient errors. The two retry strategies should compose, not replace each other.

The Kimi OAuth fallback is inside the single-key else branch in anthropic-compat-handler.ts. Multi-key users of Kimi silently lose OAuth support. That's a feature regression.

Also note that PR #70 (provider refactor) rewrites most of the same files. If that merges first, this PR will need a full rewrite against the new architecture. Might be worth waiting to see how #70 lands.

The feature itself is valuable though. Happy to help figure out the right integration approach.

@erudenko
Copy link
Copy Markdown
Member

Putting this on hold until PR #70 (provider refactor) lands. That PR rewrites every handler file this touches, so integrating KeyPool against the current architecture would just be throwaway work.

The KeyPool class itself is solid and we'll use it. Once #70 gives us the new 3-layer architecture, the integration will be cleaner since there's one handler (ProviderHandler) instead of six separate ones with copy-pasted fetch logic.

I'll ping you when #70 is in and we're ready to wire KeyPool into the new transport layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Multi-Key API Key Rotation with Automatic Failover (KeyPool)

3 participants