Skip to content

[Feature] Prompt Caching & Provider-Specific Caching Support #813

@oyi77

Description

@oyi77

Summary

Add intelligent prompt caching at the proxy level with provider-specific cache adapters that leverage each provider's native caching mechanisms (Anthropic cache_control, Gemini cachedContent, OpenAI prompt_cache_key, etc.) — transparently, without client-side changes.

Problem

OmniRoute currently has a response-level semantic cache (SHA-256 signature of full request → cached response) and a basic cache_control passthrough layer that preserves client-provided markers. But significant gaps remain:

  1. No automatic prompt caching: The proxy doesn't intelligently inject cache_control markers — it only preserves them if the client (e.g., Claude Code) already sends them. Clients that don't send markers get zero caching benefit.

  2. No provider-native prefix caching: Providers like Anthropic, Gemini, OpenAI, and DeepSeek support server-side prompt prefix caching (where the provider caches the KV/state for repeated prefixes). OmniRoute doesn't leverage this — every request pays full input token cost even when 90% of the prompt is identical across requests.

  3. No cache analytics: There's no dashboard or API to track cache hit rates, token savings, or cost reduction from caching. Users have no visibility into caching effectiveness.

  4. No cross-provider cache awareness: When a request falls back from Claude → DeepSeek, the cache context is lost. Each provider starts from scratch even if the system prompt hasn't changed.

  5. No Gemini cachedContent support: Gemini supports explicit cachedContent IDs for long-context reuse, but the translator doesn't create or reference them.

  6. No OpenAI prompt_cache_key support: OpenAI's automatic prompt caching (prompt_cache_key) isn't exposed through OmniRoute.

Current State

Component Location What It Does
Semantic Cache src/lib/semanticCache.ts Caches full responses (model + messages + temp + top_p → response). Two-tier: LRU + SQLite.
Cache Control Policy open-sse/utils/cacheControlPolicy.ts Decides whether to preserve client cache_control markers. Supports Claude + Qwen.
Cache Control Settings src/lib/cacheControlSettings.ts Cached DB access for alwaysPreserveClientCache mode (auto/always/never).
Claude Translator open-sse/translator/helpers/claudeHelper.ts Injects cache_control: { type: "ephemeral", ttl: "1h" } on last system, assistant, and tool blocks.
Search Cache open-sse/services/searchCache.ts TTL cache for web search results with request coalescing.
chatCore Cache Tracking open-sse/handlers/chatCore.ts Extracts cache_read_input_tokens, cache_creation_input_tokens from response usage.

Proposed Architecture

1. Prompt Cache Layer (src/lib/promptCache/)

A provider-agnostic prompt caching layer that intelligently manages prompt prefix caching across all providers.

src/lib/promptCache/
├── index.ts              // Public API: getCacheHint(), reportCacheHit()
├── prefixAnalyzer.ts     // Analyzes message arrays to find stable prefixes
├── markerInjector.ts     // Injects provider-specific cache markers
├── providerAdapters.ts   // Provider-specific cache strategies
├── store.ts              // SQLite store for cache metadata (not content)
└── analytics.ts          // Cache hit/miss/savings tracking

How it works:

Client Request (no cache awareness)
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│                 OmniRoute Proxy (chatCore)               │
│                                                          │
│  1. prefixAnalyzer.analyze(messages)                     │
│     → Identifies stable prefix (system + tools + history)│
│     → Returns { prefixEndIdx, prefixHash, prefixTokens } │
│                                                          │
│  2. Check provider adapter for caching support           │
│     → Claude: inject cache_control on prefix boundary    │
│     → Gemini: create/reference cachedContent ID          │
│     → OpenAI: set prompt_cache_key                       │
│     → DeepSeek: inject cache_control (Claude-compat)     │
│                                                          │
│  3. Forward enriched request upstream                    │
│                                                          │
│  4. On response:                                         │
│     → Extract cache metrics from usage                   │
│     → Store cache metadata (prefix hash, provider, hit)  │
│     → Update analytics counters                          │
└─────────────────────────────────────────────────────────┘

2. Prefix Analyzer

Detects stable prefix boundaries in message arrays — the portion of the prompt that's likely to repeat across requests.

interface PrefixAnalysis {
  prefixEndIdx: number;      // Last message index in stable prefix
  prefixHash: string;        // SHA-256 of the prefix content
  prefixTokens: number;      // Estimated token count of prefix
  prefixType: 'system_only' | 'system_and_tools' | 'system_tools_history';
  confidence: number;        // 0-1 confidence this prefix will repeat
}

Detection heuristics:

Signal Weight Description
System message present High System prompt is almost always stable
Tools defined High Tool schemas don't change between requests
Message count > 4 Medium Longer conversations have more stable prefix
First N messages identical to previous request Very High Direct prefix match from session history
User message is short (follow-up) Medium Short user messages imply context continuation

3. Provider Cache Adapters

Each provider has a different caching mechanism. The adapter layer normalizes these.

Anthropic (Claude)

// Current: Manual cache_control injection on last block
// Proposed: Intelligent multi-point caching with prefix awareness

{
  type: "text",
  text: systemPrompt,
  cache_control: { type: "ephemeral", ttl: "5m" }  // System prefix
},
// ... user messages ...
{
  type: "text", 
  text: "...",
  cache_control: { type: "ephemeral", ttl: "5m" }  // History checkpoint
}

Enhancements:

  • Support multiple cache breakpoints (not just last block)
  • Configurable TTL (5m default, 1h for long sessions)
  • Cache breakpoint strategy: auto (prefix analyzer), system-only, every-message, manual
  • Respect Anthropic's 4 cache breakpoint limit

Google (Gemini)

// Gemini uses cachedContent IDs for explicit cache reuse
// Request:
{
  contents: [...],
  cachedContent: "cachedContents/abc123"  // Reference to cached prefix
}

// Or inline caching with cache markers:
{
  contents: [
    { role: "user", parts: [{ text: "..." }] }  // Cached prefix
  ],
  generationConfig: { ... }
}

Implementation:

  • Create cachedContent via POST /cachedContents for long prompts (>32K tokens)
  • Use implicit caching (automatic) for shorter prompts
  • Track cachedContentTokenCount in usage metadata

OpenAI

// OpenAI has automatic prompt caching (no markers needed)
// But exposes prompt_cache_key for explicit control:
{
  model: "gpt-4o",
  messages: [...],
  prompt_cache_key: "my-session-prefix-abc123"  // Optional: explicit cache key
}

Implementation:

  • Expose prompt_cache_key in request params
  • Auto-generate from prefix hash for consistent caching
  • Track prompt_tokens_details.cached_tokens in response

DeepSeek

// DeepSeek uses Anthropic-compatible cache_control:
{
  role: "system",
  content: [
    { type: "text", text: "...", cache_control: { type: "ephemeral" } }
  ]
}

Implementation:

  • Reuse Claude adapter logic
  • DeepSeek caches system prompt automatically (prefix caching)
  • Track prompt_cache_hit_tokens / prompt_cache_miss_tokens

Alibaba Qwen / Z.AI (GLM)

// Qwen Coding Plan supports Anthropic-compatible prompt caching
// Already listed in CACHING_PROVIDERS

4. Cache Analytics

New API endpoints:

Endpoint Description
GET /api/cache/prompt/stats Prompt cache hit rate, token savings, cost savings
GET /api/cache/prompt/breakdown Per-provider, per-model cache breakdown
GET /api/cache/prompt/sessions Top cached sessions by savings
POST /api/cache/prompt/flush Clear prompt cache metadata

Metrics tracked:

interface PromptCacheMetrics {
  // Totals
  totalRequests: number;
  requestsWithCacheHit: number;
  hitRate: number;  // %

  // Tokens
  totalInputTokens: number;
  totalCachedTokens: number;
  totalCacheCreationTokens: number;
  tokenSavingsPercent: number;

  // Cost
  estimatedFullCost: number;
  estimatedCachedCost: number;
  costSaved: number;

  // Per-provider
  byProvider: Record<string, {
    requests: number;
    hits: number;
    cachedTokens: number;
    costSaved: number;
  }>;

  // Per-model
  byModel: Record<string, {
    requests: number;
    hits: number;
    cachedTokens: number;
  }>;

  lastUpdated: string;
}

5. Dashboard Integration

Extend: /dashboard/analytics — New "Cache" tab

Section Content
Cache Overview Hit rate gauge, total tokens saved, cost saved
Provider Breakdown Bar chart: cache hits by provider
Trend Chart Line chart: cache hit rate over time
Top Sessions Table: sessions with highest cache savings
Cache Health Active cache entries, memory usage, eviction rate

Extend: /dashboard/settings — New "Caching" tab

Setting Description Default
promptCacheEnabled Enable prompt caching globally true
promptCacheStrategy auto / system-only / manual auto
promptCacheDefaultTTL Default cache TTL for Anthropic 5m
promptCacheMaxBreakpoints Max cache breakpoints (Anthropic: 4) 4
geminiCachedContentEnabled Enable Gemini cachedContent creation true
openaiPromptCacheKeyEnabled Enable OpenAI prompt_cache_key true
semanticCacheEnabled Existing semantic cache toggle true

6. Configuration Schema

interface PromptCacheConfig {
  enabled: boolean;
  strategy: 'auto' | 'system-only' | 'every-message' | 'manual';
  
  // Anthropic-specific
  anthropic: {
    enabled: boolean;
    defaultTTL: '5m' | '1h';
    maxBreakpoints: 1 | 2 | 3 | 4;
    breakpointPlacement: 'system' | 'tools' | 'last-user' | 'auto';
  };
  
  // Gemini-specific
  gemini: {
    enabled: boolean;
    minTokensForCachedContent: number;  // Default: 32768
    cachedContentTTL: string;  // Default: "30m"
  };
  
  // OpenAI-specific
  openai: {
    enabled: boolean;
    autoGenerateCacheKey: boolean;  // Default: true
  };
  
  // DeepSeek-specific
  deepseek: {
    enabled: boolean;
    // Reuses Anthropic adapter
  };
  
  // Analytics
  analytics: {
    enabled: boolean;
    retentionDays: number;  // Default: 30
  };
}

7. MCP Tools

Tool Scope Description
omniroute_cache_stats read:cache Prompt + semantic cache statistics
omniroute_cache_flush write:cache Flush prompt cache metadata
omniroute_cache_configure write:cache Update cache settings

8. Integration with Existing Systems

Semantic Cache (src/lib/semanticCache.ts):

  • Prompt cache and semantic cache are complementary:
    • Semantic cache: "Have I seen this exact request before?" → return cached response
    • Prompt cache: "Has the provider seen this prefix before?" → reduced input token cost
  • Both can fire on the same request (semantic cache checks first, prompt cache applies if miss)

Combo Agent Middleware (open-sse/services/comboAgentMiddleware.ts):

  • system_message override from combo config feeds into prefix analyzer
  • Cache breakpoints adjust when system prompt changes per combo

Cache Control Policy (open-sse/utils/cacheControlPolicy.ts):

  • Existing shouldPreserveCacheControl() logic is extended, not replaced
  • New: shouldInjectCacheControl() for automatic injection when client doesn't send markers
  • CACHING_PROVIDERS set expanded: add deepseek, openai, gemini

Implementation Phases

Phase 1: Prefix Analyzer + Anthropic Enhancement (1-2 weeks)

  • prefixAnalyzer.ts with stable prefix detection
  • Multi-breakpoint cache_control injection for Claude
  • Configurable TTL and breakpoint strategy
  • Basic cache metrics extraction in chatCore

Phase 2: Provider Adapters (2-3 weeks)

  • Gemini cachedContent creation + reference
  • OpenAI prompt_cache_key generation
  • DeepSeek adapter (reuse Claude logic)
  • Unified providerAdapters.ts interface

Phase 3: Analytics + Dashboard (1-2 weeks)

  • Prompt cache metrics API endpoints
  • Dashboard "Cache" tab in Analytics
  • Settings page for cache configuration
  • Per-provider/per-model breakdowns

Phase 4: Optimization (1 week)

  • Cross-session prefix matching (session manager integration)
  • Cache warming for common prefixes
  • Adaptive TTL based on session length
  • MCP tools for cache management

Benefits

  1. Cost reduction: 50-90% input token savings for repeated prefixes (system prompt + tools + history)
  2. Latency reduction: Cached prefixes return faster (provider-side optimization)
  3. Zero client changes: Transparent injection — works with Claude Code, Codex, Gemini CLI, any client
  4. Provider-agnostic: One config controls caching across all providers
  5. Full observability: Dashboard + API for cache effectiveness tracking
  6. Backward compatible: Existing cache_control passthrough preserved, new injection is additive

Use Cases

Use Case How Prompt Caching Helps
Long coding sessions System prompt + tool schemas cached → 70%+ token savings on follow-ups
Multi-turn conversations History prefix cached → later turns cost less
RAG pipelines System + retrieved context cached → only new query is billed
Agent loops Tool definitions cached → each iteration cheaper
Cost-optimized combos Cache-aware routing → prefer providers with active cache

Notes

This feature will be implemented by us. This issue is created for backlog tracking and architectural discussion before implementation begins.

Backward compatible: All features are opt-in via settings. Existing cache_control passthrough behavior is unchanged when prompt caching is disabled.

Complementary to Issue #811: Memory & Skill injection will benefit from prompt caching — injected memories become part of the cached prefix.

Related Issues

Technical Debt / Open Questions

  1. Gemini cachedContent lifecycle: How to handle expired IDs? Auto-recreate vs. client-managed?
  2. Cache invalidation: When system prompt changes, how to invalidate stale prefix caches?
  3. Multi-breakpoint limits: Anthropic limits to 4 breakpoints — optimal placement strategy?
  4. Cost tracking accuracy: Provider cache pricing varies (Anthropic: 10% of input cost for cached tokens) — how to calculate exact savings?
  5. Streaming cache metrics: Cache usage often arrives in final SSE chunk — how to handle in streaming mode?
  6. Cross-provider prefix matching: Is it worth caching prefix hashes per provider to enable "cache-aware routing" (prefer provider with active cache)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    kilo-duplicateAuto-generated label by Kilokilo-triagedAuto-generated label by Kilo

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions