[Feature] Prompt Caching & Provider-Specific Caching Support

## Summary

Add **intelligent prompt caching** at the proxy level with **provider-specific cache adapters** that leverage each provider's native caching mechanisms (Anthropic `cache_control`, Gemini `cachedContent`, OpenAI `prompt_cache_key`, etc.) — transparently, without client-side changes.

## Problem

OmniRoute currently has a **response-level semantic cache** (SHA-256 signature of full request → cached response) and a basic **cache_control passthrough** layer that preserves client-provided markers. But significant gaps remain:

1. **No automatic prompt caching**: The proxy doesn't intelligently inject `cache_control` markers — it only preserves them if the client (e.g., Claude Code) already sends them. Clients that don't send markers get zero caching benefit.

2. **No provider-native prefix caching**: Providers like Anthropic, Gemini, OpenAI, and DeepSeek support **server-side prompt prefix caching** (where the provider caches the KV/state for repeated prefixes). OmniRoute doesn't leverage this — every request pays full input token cost even when 90% of the prompt is identical across requests.

3. **No cache analytics**: There's no dashboard or API to track cache hit rates, token savings, or cost reduction from caching. Users have no visibility into caching effectiveness.

4. **No cross-provider cache awareness**: When a request falls back from Claude → DeepSeek, the cache context is lost. Each provider starts from scratch even if the system prompt hasn't changed.

5. **No Gemini `cachedContent` support**: Gemini supports explicit `cachedContent` IDs for long-context reuse, but the translator doesn't create or reference them.

6. **No OpenAI `prompt_cache_key` support**: OpenAI's automatic prompt caching (`prompt_cache_key`) isn't exposed through OmniRoute.

## Current State

| Component | Location | What It Does |
|-----------|----------|--------------|
| Semantic Cache | `src/lib/semanticCache.ts` | Caches full responses (model + messages + temp + top_p → response). Two-tier: LRU + SQLite. |
| Cache Control Policy | `open-sse/utils/cacheControlPolicy.ts` | Decides whether to preserve client `cache_control` markers. Supports Claude + Qwen. |
| Cache Control Settings | `src/lib/cacheControlSettings.ts` | Cached DB access for `alwaysPreserveClientCache` mode (`auto`/`always`/`never`). |
| Claude Translator | `open-sse/translator/helpers/claudeHelper.ts` | Injects `cache_control: { type: "ephemeral", ttl: "1h" }` on last system, assistant, and tool blocks. |
| Search Cache | `open-sse/services/searchCache.ts` | TTL cache for web search results with request coalescing. |
| chatCore Cache Tracking | `open-sse/handlers/chatCore.ts` | Extracts `cache_read_input_tokens`, `cache_creation_input_tokens` from response usage. |

## Proposed Architecture

### 1. Prompt Cache Layer (`src/lib/promptCache/`)

A **provider-agnostic prompt caching layer** that intelligently manages prompt prefix caching across all providers.

```
src/lib/promptCache/
├── index.ts              // Public API: getCacheHint(), reportCacheHit()
├── prefixAnalyzer.ts     // Analyzes message arrays to find stable prefixes
├── markerInjector.ts     // Injects provider-specific cache markers
├── providerAdapters.ts   // Provider-specific cache strategies
├── store.ts              // SQLite store for cache metadata (not content)
└── analytics.ts          // Cache hit/miss/savings tracking
```

**How it works:**

```
Client Request (no cache awareness)
       │
       ▼
┌─────────────────────────────────────────────────────────┐
│                 OmniRoute Proxy (chatCore)               │
│                                                          │
│  1. prefixAnalyzer.analyze(messages)                     │
│     → Identifies stable prefix (system + tools + history)│
│     → Returns { prefixEndIdx, prefixHash, prefixTokens } │
│                                                          │
│  2. Check provider adapter for caching support           │
│     → Claude: inject cache_control on prefix boundary    │
│     → Gemini: create/reference cachedContent ID          │
│     → OpenAI: set prompt_cache_key                       │
│     → DeepSeek: inject cache_control (Claude-compat)     │
│                                                          │
│  3. Forward enriched request upstream                    │
│                                                          │
│  4. On response:                                         │
│     → Extract cache metrics from usage                   │
│     → Store cache metadata (prefix hash, provider, hit)  │
│     → Update analytics counters                          │
└─────────────────────────────────────────────────────────┘
```

### 2. Prefix Analyzer

Detects **stable prefix boundaries** in message arrays — the portion of the prompt that's likely to repeat across requests.

```typescript
interface PrefixAnalysis {
  prefixEndIdx: number;      // Last message index in stable prefix
  prefixHash: string;        // SHA-256 of the prefix content
  prefixTokens: number;      // Estimated token count of prefix
  prefixType: 'system_only' | 'system_and_tools' | 'system_tools_history';
  confidence: number;        // 0-1 confidence this prefix will repeat
}
```

**Detection heuristics:**

| Signal | Weight | Description |
|--------|--------|-------------|
| System message present | High | System prompt is almost always stable |
| Tools defined | High | Tool schemas don't change between requests |
| Message count > 4 | Medium | Longer conversations have more stable prefix |
| First N messages identical to previous request | Very High | Direct prefix match from session history |
| User message is short (follow-up) | Medium | Short user messages imply context continuation |

### 3. Provider Cache Adapters

Each provider has a different caching mechanism. The adapter layer normalizes these.

#### Anthropic (Claude)

```typescript
// Current: Manual cache_control injection on last block
// Proposed: Intelligent multi-point caching with prefix awareness

{
  type: "text",
  text: systemPrompt,
  cache_control: { type: "ephemeral", ttl: "5m" }  // System prefix
},
// ... user messages ...
{
  type: "text", 
  text: "...",
  cache_control: { type: "ephemeral", ttl: "5m" }  // History checkpoint
}
```

**Enhancements:**
- Support **multiple cache breakpoints** (not just last block)
- Configurable TTL (`5m` default, `1h` for long sessions)
- **Cache breakpoint strategy**: `auto` (prefix analyzer), `system-only`, `every-message`, `manual`
- Respect Anthropic's **4 cache breakpoint limit**

#### Google (Gemini)

```typescript
// Gemini uses cachedContent IDs for explicit cache reuse
// Request:
{
  contents: [...],
  cachedContent: "cachedContents/abc123"  // Reference to cached prefix
}

// Or inline caching with cache markers:
{
  contents: [
    { role: "user", parts: [{ text: "..." }] }  // Cached prefix
  ],
  generationConfig: { ... }
}
```

**Implementation:**
- Create `cachedContent` via `POST /cachedContents` for long prompts (>32K tokens)
- Use implicit caching (automatic) for shorter prompts
- Track `cachedContentTokenCount` in usage metadata

#### OpenAI

```typescript
// OpenAI has automatic prompt caching (no markers needed)
// But exposes prompt_cache_key for explicit control:
{
  model: "gpt-4o",
  messages: [...],
  prompt_cache_key: "my-session-prefix-abc123"  // Optional: explicit cache key
}
```

**Implementation:**
- Expose `prompt_cache_key` in request params
- Auto-generate from prefix hash for consistent caching
- Track `prompt_tokens_details.cached_tokens` in response

#### DeepSeek

```typescript
// DeepSeek uses Anthropic-compatible cache_control:
{
  role: "system",
  content: [
    { type: "text", text: "...", cache_control: { type: "ephemeral" } }
  ]
}
```

**Implementation:**
- Reuse Claude adapter logic
- DeepSeek caches system prompt automatically (prefix caching)
- Track `prompt_cache_hit_tokens` / `prompt_cache_miss_tokens`

#### Alibaba Qwen / Z.AI (GLM)

```typescript
// Qwen Coding Plan supports Anthropic-compatible prompt caching
// Already listed in CACHING_PROVIDERS
```

### 4. Cache Analytics

**New API endpoints:**

| Endpoint | Description |
|----------|-------------|
| `GET /api/cache/prompt/stats` | Prompt cache hit rate, token savings, cost savings |
| `GET /api/cache/prompt/breakdown` | Per-provider, per-model cache breakdown |
| `GET /api/cache/prompt/sessions` | Top cached sessions by savings |
| `POST /api/cache/prompt/flush` | Clear prompt cache metadata |

**Metrics tracked:**

```typescript
interface PromptCacheMetrics {
  // Totals
  totalRequests: number;
  requestsWithCacheHit: number;
  hitRate: number;  // %

  // Tokens
  totalInputTokens: number;
  totalCachedTokens: number;
  totalCacheCreationTokens: number;
  tokenSavingsPercent: number;

  // Cost
  estimatedFullCost: number;
  estimatedCachedCost: number;
  costSaved: number;

  // Per-provider
  byProvider: Record<string, {
    requests: number;
    hits: number;
    cachedTokens: number;
    costSaved: number;
  }>;

  // Per-model
  byModel: Record<string, {
    requests: number;
    hits: number;
    cachedTokens: number;
  }>;

  lastUpdated: string;
}
```

### 5. Dashboard Integration

**Extend: `/dashboard/analytics`** — New "Cache" tab

| Section | Content |
|---------|---------|
| **Cache Overview** | Hit rate gauge, total tokens saved, cost saved |
| **Provider Breakdown** | Bar chart: cache hits by provider |
| **Trend Chart** | Line chart: cache hit rate over time |
| **Top Sessions** | Table: sessions with highest cache savings |
| **Cache Health** | Active cache entries, memory usage, eviction rate |

**Extend: `/dashboard/settings`** — New "Caching" tab

| Setting | Description | Default |
|---------|-------------|---------|
| `promptCacheEnabled` | Enable prompt caching globally | `true` |
| `promptCacheStrategy` | `auto` / `system-only` / `manual` | `auto` |
| `promptCacheDefaultTTL` | Default cache TTL for Anthropic | `5m` |
| `promptCacheMaxBreakpoints` | Max cache breakpoints (Anthropic: 4) | `4` |
| `geminiCachedContentEnabled` | Enable Gemini cachedContent creation | `true` |
| `openaiPromptCacheKeyEnabled` | Enable OpenAI prompt_cache_key | `true` |
| `semanticCacheEnabled` | Existing semantic cache toggle | `true` |

### 6. Configuration Schema

```typescript
interface PromptCacheConfig {
  enabled: boolean;
  strategy: 'auto' | 'system-only' | 'every-message' | 'manual';
  
  // Anthropic-specific
  anthropic: {
    enabled: boolean;
    defaultTTL: '5m' | '1h';
    maxBreakpoints: 1 | 2 | 3 | 4;
    breakpointPlacement: 'system' | 'tools' | 'last-user' | 'auto';
  };
  
  // Gemini-specific
  gemini: {
    enabled: boolean;
    minTokensForCachedContent: number;  // Default: 32768
    cachedContentTTL: string;  // Default: "30m"
  };
  
  // OpenAI-specific
  openai: {
    enabled: boolean;
    autoGenerateCacheKey: boolean;  // Default: true
  };
  
  // DeepSeek-specific
  deepseek: {
    enabled: boolean;
    // Reuses Anthropic adapter
  };
  
  // Analytics
  analytics: {
    enabled: boolean;
    retentionDays: number;  // Default: 30
  };
}
```

### 7. MCP Tools

| Tool | Scope | Description |
|------|-------|-------------|
| `omniroute_cache_stats` | `read:cache` | Prompt + semantic cache statistics |
| `omniroute_cache_flush` | `write:cache` | Flush prompt cache metadata |
| `omniroute_cache_configure` | `write:cache` | Update cache settings |

### 8. Integration with Existing Systems

**Semantic Cache (`src/lib/semanticCache.ts`):**
- Prompt cache and semantic cache are **complementary**:
  - Semantic cache: "Have I seen this exact request before?" → return cached response
  - Prompt cache: "Has the provider seen this prefix before?" → reduced input token cost
- Both can fire on the same request (semantic cache checks first, prompt cache applies if miss)

**Combo Agent Middleware (`open-sse/services/comboAgentMiddleware.ts`):**
- `system_message` override from combo config feeds into prefix analyzer
- Cache breakpoints adjust when system prompt changes per combo

**Cache Control Policy (`open-sse/utils/cacheControlPolicy.ts`):**
- Existing `shouldPreserveCacheControl()` logic is **extended**, not replaced
- New: `shouldInjectCacheControl()` for automatic injection when client doesn't send markers
- `CACHING_PROVIDERS` set expanded: add `deepseek`, `openai`, `gemini`

## Implementation Phases

**Phase 1: Prefix Analyzer + Anthropic Enhancement (1-2 weeks)**
- `prefixAnalyzer.ts` with stable prefix detection
- Multi-breakpoint `cache_control` injection for Claude
- Configurable TTL and breakpoint strategy
- Basic cache metrics extraction in chatCore

**Phase 2: Provider Adapters (2-3 weeks)**
- Gemini `cachedContent` creation + reference
- OpenAI `prompt_cache_key` generation
- DeepSeek adapter (reuse Claude logic)
- Unified `providerAdapters.ts` interface

**Phase 3: Analytics + Dashboard (1-2 weeks)**
- Prompt cache metrics API endpoints
- Dashboard "Cache" tab in Analytics
- Settings page for cache configuration
- Per-provider/per-model breakdowns

**Phase 4: Optimization (1 week)**
- Cross-session prefix matching (session manager integration)
- Cache warming for common prefixes
- Adaptive TTL based on session length
- MCP tools for cache management

## Benefits

1. **Cost reduction**: 50-90% input token savings for repeated prefixes (system prompt + tools + history)
2. **Latency reduction**: Cached prefixes return faster (provider-side optimization)
3. **Zero client changes**: Transparent injection — works with Claude Code, Codex, Gemini CLI, any client
4. **Provider-agnostic**: One config controls caching across all providers
5. **Full observability**: Dashboard + API for cache effectiveness tracking
6. **Backward compatible**: Existing `cache_control` passthrough preserved, new injection is additive

## Use Cases

| Use Case | How Prompt Caching Helps |
|----------|--------------------------|
| **Long coding sessions** | System prompt + tool schemas cached → 70%+ token savings on follow-ups |
| **Multi-turn conversations** | History prefix cached → later turns cost less |
| **RAG pipelines** | System + retrieved context cached → only new query is billed |
| **Agent loops** | Tool definitions cached → each iteration cheaper |
| **Cost-optimized combos** | Cache-aware routing → prefer providers with active cache |

## Notes

> **This feature will be implemented by us.** This issue is created for backlog tracking and architectural discussion before implementation begins.

> **Backward compatible**: All features are opt-in via settings. Existing `cache_control` passthrough behavior is unchanged when prompt caching is disabled.

> **Complementary to Issue #811**: Memory & Skill injection will benefit from prompt caching — injected memories become part of the cached prefix.

## Related Issues

- #811 — Memory & Skill Injection from Proxy (memory context benefits from prompt caching)
- #399 — Combo Agent Features (system_message override affects prefix caching)
- #401 — Context Caching Protection (related `<omniModel>` tag system)

## Technical Debt / Open Questions

1. **Gemini `cachedContent` lifecycle**: How to handle expired IDs? Auto-recreate vs. client-managed?
2. **Cache invalidation**: When system prompt changes, how to invalidate stale prefix caches?
3. **Multi-breakpoint limits**: Anthropic limits to 4 breakpoints — optimal placement strategy?
4. **Cost tracking accuracy**: Provider cache pricing varies (Anthropic: 10% of input cost for cached tokens) — how to calculate exact savings?
5. **Streaming cache metrics**: Cache usage often arrives in final SSE chunk — how to handle in streaming mode?
6. **Cross-provider prefix matching**: Is it worth caching prefix hashes per provider to enable "cache-aware routing" (prefer provider with active cache)?

Setting	Description	Default
`promptCacheEnabled`	Enable prompt caching globally	`true`
`promptCacheStrategy`	`auto` / `system-only` / `manual`	`auto`
`promptCacheDefaultTTL`	Default cache TTL for Anthropic	`5m`
`promptCacheMaxBreakpoints`	Max cache breakpoints (Anthropic: 4)	`4`
`geminiCachedContentEnabled`	Enable Gemini cachedContent creation	`true`
`openaiPromptCacheKeyEnabled`	Enable OpenAI prompt_cache_key	`true`
`semanticCacheEnabled`	Existing semantic cache toggle	`true`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Prompt Caching & Provider-Specific Caching Support #813

Summary

Problem

Current State

Proposed Architecture

1. Prompt Cache Layer (`src/lib/promptCache/`)

2. Prefix Analyzer

3. Provider Cache Adapters

Anthropic (Claude)

Google (Gemini)

OpenAI

DeepSeek

Alibaba Qwen / Z.AI (GLM)

4. Cache Analytics

5. Dashboard Integration

6. Configuration Schema

7. MCP Tools

8. Integration with Existing Systems

Implementation Phases

Benefits

Use Cases

Notes

Related Issues

Technical Debt / Open Questions

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Component	Location	What It Does
Semantic Cache	`src/lib/semanticCache.ts`	Caches full responses (model + messages + temp + top_p → response). Two-tier: LRU + SQLite.
Cache Control Policy	`open-sse/utils/cacheControlPolicy.ts`	Decides whether to preserve client `cache_control` markers. Supports Claude + Qwen.
Cache Control Settings	`src/lib/cacheControlSettings.ts`	Cached DB access for `alwaysPreserveClientCache` mode (`auto`/`always`/`never`).
Claude Translator	`open-sse/translator/helpers/claudeHelper.ts`	Injects `cache_control: { type: "ephemeral", ttl: "1h" }` on last system, assistant, and tool blocks.
Search Cache	`open-sse/services/searchCache.ts`	TTL cache for web search results with request coalescing.
chatCore Cache Tracking	`open-sse/handlers/chatCore.ts`	Extracts `cache_read_input_tokens`, `cache_creation_input_tokens` from response usage.

Signal	Weight	Description
System message present	High	System prompt is almost always stable
Tools defined	High	Tool schemas don't change between requests
Message count > 4	Medium	Longer conversations have more stable prefix
First N messages identical to previous request	Very High	Direct prefix match from session history
User message is short (follow-up)	Medium	Short user messages imply context continuation

Endpoint	Description
`GET /api/cache/prompt/stats`	Prompt cache hit rate, token savings, cost savings
`GET /api/cache/prompt/breakdown`	Per-provider, per-model cache breakdown
`GET /api/cache/prompt/sessions`	Top cached sessions by savings
`POST /api/cache/prompt/flush`	Clear prompt cache metadata

Section	Content
Cache Overview	Hit rate gauge, total tokens saved, cost saved
Provider Breakdown	Bar chart: cache hits by provider
Trend Chart	Line chart: cache hit rate over time
Top Sessions	Table: sessions with highest cache savings
Cache Health	Active cache entries, memory usage, eviction rate

Tool	Scope	Description
`omniroute_cache_stats`	`read:cache`	Prompt + semantic cache statistics
`omniroute_cache_flush`	`write:cache`	Flush prompt cache metadata
`omniroute_cache_configure`	`write:cache`	Update cache settings

Use Case	How Prompt Caching Helps
Long coding sessions	System prompt + tool schemas cached → 70%+ token savings on follow-ups
Multi-turn conversations	History prefix cached → later turns cost less
RAG pipelines	System + retrieved context cached → only new query is billed
Agent loops	Tool definitions cached → each iteration cheaper
Cost-optimized combos	Cache-aware routing → prefer providers with active cache

[Feature] Prompt Caching & Provider-Specific Caching Support #813

Description

Summary

Problem

Current State

Proposed Architecture

1. Prompt Cache Layer (src/lib/promptCache/)

2. Prefix Analyzer

3. Provider Cache Adapters

Anthropic (Claude)

Google (Gemini)

OpenAI

DeepSeek

Alibaba Qwen / Z.AI (GLM)

4. Cache Analytics

5. Dashboard Integration

6. Configuration Schema

7. MCP Tools

8. Integration with Existing Systems

Implementation Phases

Benefits

Use Cases

Notes

Related Issues

Technical Debt / Open Questions

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

1. Prompt Cache Layer (`src/lib/promptCache/`)