-
Notifications
You must be signed in to change notification settings - Fork 288
[Feature] Prompt Caching & Provider-Specific Caching Support #813
Description
Summary
Add intelligent prompt caching at the proxy level with provider-specific cache adapters that leverage each provider's native caching mechanisms (Anthropic cache_control, Gemini cachedContent, OpenAI prompt_cache_key, etc.) — transparently, without client-side changes.
Problem
OmniRoute currently has a response-level semantic cache (SHA-256 signature of full request → cached response) and a basic cache_control passthrough layer that preserves client-provided markers. But significant gaps remain:
-
No automatic prompt caching: The proxy doesn't intelligently inject
cache_controlmarkers — it only preserves them if the client (e.g., Claude Code) already sends them. Clients that don't send markers get zero caching benefit. -
No provider-native prefix caching: Providers like Anthropic, Gemini, OpenAI, and DeepSeek support server-side prompt prefix caching (where the provider caches the KV/state for repeated prefixes). OmniRoute doesn't leverage this — every request pays full input token cost even when 90% of the prompt is identical across requests.
-
No cache analytics: There's no dashboard or API to track cache hit rates, token savings, or cost reduction from caching. Users have no visibility into caching effectiveness.
-
No cross-provider cache awareness: When a request falls back from Claude → DeepSeek, the cache context is lost. Each provider starts from scratch even if the system prompt hasn't changed.
-
No Gemini
cachedContentsupport: Gemini supports explicitcachedContentIDs for long-context reuse, but the translator doesn't create or reference them. -
No OpenAI
prompt_cache_keysupport: OpenAI's automatic prompt caching (prompt_cache_key) isn't exposed through OmniRoute.
Current State
| Component | Location | What It Does |
|---|---|---|
| Semantic Cache | src/lib/semanticCache.ts |
Caches full responses (model + messages + temp + top_p → response). Two-tier: LRU + SQLite. |
| Cache Control Policy | open-sse/utils/cacheControlPolicy.ts |
Decides whether to preserve client cache_control markers. Supports Claude + Qwen. |
| Cache Control Settings | src/lib/cacheControlSettings.ts |
Cached DB access for alwaysPreserveClientCache mode (auto/always/never). |
| Claude Translator | open-sse/translator/helpers/claudeHelper.ts |
Injects cache_control: { type: "ephemeral", ttl: "1h" } on last system, assistant, and tool blocks. |
| Search Cache | open-sse/services/searchCache.ts |
TTL cache for web search results with request coalescing. |
| chatCore Cache Tracking | open-sse/handlers/chatCore.ts |
Extracts cache_read_input_tokens, cache_creation_input_tokens from response usage. |
Proposed Architecture
1. Prompt Cache Layer (src/lib/promptCache/)
A provider-agnostic prompt caching layer that intelligently manages prompt prefix caching across all providers.
src/lib/promptCache/
├── index.ts // Public API: getCacheHint(), reportCacheHit()
├── prefixAnalyzer.ts // Analyzes message arrays to find stable prefixes
├── markerInjector.ts // Injects provider-specific cache markers
├── providerAdapters.ts // Provider-specific cache strategies
├── store.ts // SQLite store for cache metadata (not content)
└── analytics.ts // Cache hit/miss/savings tracking
How it works:
Client Request (no cache awareness)
│
▼
┌─────────────────────────────────────────────────────────┐
│ OmniRoute Proxy (chatCore) │
│ │
│ 1. prefixAnalyzer.analyze(messages) │
│ → Identifies stable prefix (system + tools + history)│
│ → Returns { prefixEndIdx, prefixHash, prefixTokens } │
│ │
│ 2. Check provider adapter for caching support │
│ → Claude: inject cache_control on prefix boundary │
│ → Gemini: create/reference cachedContent ID │
│ → OpenAI: set prompt_cache_key │
│ → DeepSeek: inject cache_control (Claude-compat) │
│ │
│ 3. Forward enriched request upstream │
│ │
│ 4. On response: │
│ → Extract cache metrics from usage │
│ → Store cache metadata (prefix hash, provider, hit) │
│ → Update analytics counters │
└─────────────────────────────────────────────────────────┘
2. Prefix Analyzer
Detects stable prefix boundaries in message arrays — the portion of the prompt that's likely to repeat across requests.
interface PrefixAnalysis {
prefixEndIdx: number; // Last message index in stable prefix
prefixHash: string; // SHA-256 of the prefix content
prefixTokens: number; // Estimated token count of prefix
prefixType: 'system_only' | 'system_and_tools' | 'system_tools_history';
confidence: number; // 0-1 confidence this prefix will repeat
}Detection heuristics:
| Signal | Weight | Description |
|---|---|---|
| System message present | High | System prompt is almost always stable |
| Tools defined | High | Tool schemas don't change between requests |
| Message count > 4 | Medium | Longer conversations have more stable prefix |
| First N messages identical to previous request | Very High | Direct prefix match from session history |
| User message is short (follow-up) | Medium | Short user messages imply context continuation |
3. Provider Cache Adapters
Each provider has a different caching mechanism. The adapter layer normalizes these.
Anthropic (Claude)
// Current: Manual cache_control injection on last block
// Proposed: Intelligent multi-point caching with prefix awareness
{
type: "text",
text: systemPrompt,
cache_control: { type: "ephemeral", ttl: "5m" } // System prefix
},
// ... user messages ...
{
type: "text",
text: "...",
cache_control: { type: "ephemeral", ttl: "5m" } // History checkpoint
}Enhancements:
- Support multiple cache breakpoints (not just last block)
- Configurable TTL (
5mdefault,1hfor long sessions) - Cache breakpoint strategy:
auto(prefix analyzer),system-only,every-message,manual - Respect Anthropic's 4 cache breakpoint limit
Google (Gemini)
// Gemini uses cachedContent IDs for explicit cache reuse
// Request:
{
contents: [...],
cachedContent: "cachedContents/abc123" // Reference to cached prefix
}
// Or inline caching with cache markers:
{
contents: [
{ role: "user", parts: [{ text: "..." }] } // Cached prefix
],
generationConfig: { ... }
}Implementation:
- Create
cachedContentviaPOST /cachedContentsfor long prompts (>32K tokens) - Use implicit caching (automatic) for shorter prompts
- Track
cachedContentTokenCountin usage metadata
OpenAI
// OpenAI has automatic prompt caching (no markers needed)
// But exposes prompt_cache_key for explicit control:
{
model: "gpt-4o",
messages: [...],
prompt_cache_key: "my-session-prefix-abc123" // Optional: explicit cache key
}Implementation:
- Expose
prompt_cache_keyin request params - Auto-generate from prefix hash for consistent caching
- Track
prompt_tokens_details.cached_tokensin response
DeepSeek
// DeepSeek uses Anthropic-compatible cache_control:
{
role: "system",
content: [
{ type: "text", text: "...", cache_control: { type: "ephemeral" } }
]
}Implementation:
- Reuse Claude adapter logic
- DeepSeek caches system prompt automatically (prefix caching)
- Track
prompt_cache_hit_tokens/prompt_cache_miss_tokens
Alibaba Qwen / Z.AI (GLM)
// Qwen Coding Plan supports Anthropic-compatible prompt caching
// Already listed in CACHING_PROVIDERS4. Cache Analytics
New API endpoints:
| Endpoint | Description |
|---|---|
GET /api/cache/prompt/stats |
Prompt cache hit rate, token savings, cost savings |
GET /api/cache/prompt/breakdown |
Per-provider, per-model cache breakdown |
GET /api/cache/prompt/sessions |
Top cached sessions by savings |
POST /api/cache/prompt/flush |
Clear prompt cache metadata |
Metrics tracked:
interface PromptCacheMetrics {
// Totals
totalRequests: number;
requestsWithCacheHit: number;
hitRate: number; // %
// Tokens
totalInputTokens: number;
totalCachedTokens: number;
totalCacheCreationTokens: number;
tokenSavingsPercent: number;
// Cost
estimatedFullCost: number;
estimatedCachedCost: number;
costSaved: number;
// Per-provider
byProvider: Record<string, {
requests: number;
hits: number;
cachedTokens: number;
costSaved: number;
}>;
// Per-model
byModel: Record<string, {
requests: number;
hits: number;
cachedTokens: number;
}>;
lastUpdated: string;
}5. Dashboard Integration
Extend: /dashboard/analytics — New "Cache" tab
| Section | Content |
|---|---|
| Cache Overview | Hit rate gauge, total tokens saved, cost saved |
| Provider Breakdown | Bar chart: cache hits by provider |
| Trend Chart | Line chart: cache hit rate over time |
| Top Sessions | Table: sessions with highest cache savings |
| Cache Health | Active cache entries, memory usage, eviction rate |
Extend: /dashboard/settings — New "Caching" tab
| Setting | Description | Default |
|---|---|---|
promptCacheEnabled |
Enable prompt caching globally | true |
promptCacheStrategy |
auto / system-only / manual |
auto |
promptCacheDefaultTTL |
Default cache TTL for Anthropic | 5m |
promptCacheMaxBreakpoints |
Max cache breakpoints (Anthropic: 4) | 4 |
geminiCachedContentEnabled |
Enable Gemini cachedContent creation | true |
openaiPromptCacheKeyEnabled |
Enable OpenAI prompt_cache_key | true |
semanticCacheEnabled |
Existing semantic cache toggle | true |
6. Configuration Schema
interface PromptCacheConfig {
enabled: boolean;
strategy: 'auto' | 'system-only' | 'every-message' | 'manual';
// Anthropic-specific
anthropic: {
enabled: boolean;
defaultTTL: '5m' | '1h';
maxBreakpoints: 1 | 2 | 3 | 4;
breakpointPlacement: 'system' | 'tools' | 'last-user' | 'auto';
};
// Gemini-specific
gemini: {
enabled: boolean;
minTokensForCachedContent: number; // Default: 32768
cachedContentTTL: string; // Default: "30m"
};
// OpenAI-specific
openai: {
enabled: boolean;
autoGenerateCacheKey: boolean; // Default: true
};
// DeepSeek-specific
deepseek: {
enabled: boolean;
// Reuses Anthropic adapter
};
// Analytics
analytics: {
enabled: boolean;
retentionDays: number; // Default: 30
};
}7. MCP Tools
| Tool | Scope | Description |
|---|---|---|
omniroute_cache_stats |
read:cache |
Prompt + semantic cache statistics |
omniroute_cache_flush |
write:cache |
Flush prompt cache metadata |
omniroute_cache_configure |
write:cache |
Update cache settings |
8. Integration with Existing Systems
Semantic Cache (src/lib/semanticCache.ts):
- Prompt cache and semantic cache are complementary:
- Semantic cache: "Have I seen this exact request before?" → return cached response
- Prompt cache: "Has the provider seen this prefix before?" → reduced input token cost
- Both can fire on the same request (semantic cache checks first, prompt cache applies if miss)
Combo Agent Middleware (open-sse/services/comboAgentMiddleware.ts):
system_messageoverride from combo config feeds into prefix analyzer- Cache breakpoints adjust when system prompt changes per combo
Cache Control Policy (open-sse/utils/cacheControlPolicy.ts):
- Existing
shouldPreserveCacheControl()logic is extended, not replaced - New:
shouldInjectCacheControl()for automatic injection when client doesn't send markers CACHING_PROVIDERSset expanded: adddeepseek,openai,gemini
Implementation Phases
Phase 1: Prefix Analyzer + Anthropic Enhancement (1-2 weeks)
prefixAnalyzer.tswith stable prefix detection- Multi-breakpoint
cache_controlinjection for Claude - Configurable TTL and breakpoint strategy
- Basic cache metrics extraction in chatCore
Phase 2: Provider Adapters (2-3 weeks)
- Gemini
cachedContentcreation + reference - OpenAI
prompt_cache_keygeneration - DeepSeek adapter (reuse Claude logic)
- Unified
providerAdapters.tsinterface
Phase 3: Analytics + Dashboard (1-2 weeks)
- Prompt cache metrics API endpoints
- Dashboard "Cache" tab in Analytics
- Settings page for cache configuration
- Per-provider/per-model breakdowns
Phase 4: Optimization (1 week)
- Cross-session prefix matching (session manager integration)
- Cache warming for common prefixes
- Adaptive TTL based on session length
- MCP tools for cache management
Benefits
- Cost reduction: 50-90% input token savings for repeated prefixes (system prompt + tools + history)
- Latency reduction: Cached prefixes return faster (provider-side optimization)
- Zero client changes: Transparent injection — works with Claude Code, Codex, Gemini CLI, any client
- Provider-agnostic: One config controls caching across all providers
- Full observability: Dashboard + API for cache effectiveness tracking
- Backward compatible: Existing
cache_controlpassthrough preserved, new injection is additive
Use Cases
| Use Case | How Prompt Caching Helps |
|---|---|
| Long coding sessions | System prompt + tool schemas cached → 70%+ token savings on follow-ups |
| Multi-turn conversations | History prefix cached → later turns cost less |
| RAG pipelines | System + retrieved context cached → only new query is billed |
| Agent loops | Tool definitions cached → each iteration cheaper |
| Cost-optimized combos | Cache-aware routing → prefer providers with active cache |
Notes
This feature will be implemented by us. This issue is created for backlog tracking and architectural discussion before implementation begins.
Backward compatible: All features are opt-in via settings. Existing
cache_controlpassthrough behavior is unchanged when prompt caching is disabled.
Complementary to Issue #811: Memory & Skill injection will benefit from prompt caching — injected memories become part of the cached prefix.
Related Issues
- fix: UI fallbacks and Electron release workflow #811 — Memory & Skill Injection from Proxy (memory context benefits from prompt caching)
- [Feature Request]: System Message Modification and Regex Tool Filtering for OmniRoute Combo #399 — Combo Agent Features (system_message override affects prefix caching)
- [Feature] Context Caching Protection (Stateless Compatible) #401 — Context Caching Protection (related
<omniModel>tag system)
Technical Debt / Open Questions
- Gemini
cachedContentlifecycle: How to handle expired IDs? Auto-recreate vs. client-managed? - Cache invalidation: When system prompt changes, how to invalidate stale prefix caches?
- Multi-breakpoint limits: Anthropic limits to 4 breakpoints — optimal placement strategy?
- Cost tracking accuracy: Provider cache pricing varies (Anthropic: 10% of input cost for cached tokens) — how to calculate exact savings?
- Streaming cache metrics: Cache usage often arrives in final SSE chunk — how to handle in streaming mode?
- Cross-provider prefix matching: Is it worth caching prefix hashes per provider to enable "cache-aware routing" (prefer provider with active cache)?