Problem
Profiling shows that 88% of conversion time is spent on two operations that produce identical results across consecutive requests in the same conversation:
| Hotspot |
% of conversion time |
What it does |
IR validation (_vendor/validate.py) |
63% |
Recursively validates IRRequest TypedDict — every tool definition, every message, every content part |
Schema sanitization (base/schema.py) |
25% |
Recursively strips unsupported JSON Schema keywords from every tool's parameter schema |
| Actual conversion logic |
~5% |
Already fast |
This makes Rosetta ~6.3x slower than LiteLLM on real-world payloads (64-msg: 2.4ms vs 0.4ms; 218-msg: 4.9ms vs 0.8ms).
Why this is wasteful
In a multi-turn agent conversation (e.g. Claude Code with 41 tools, 218 messages):
- Tool definitions are identical across all turns. The same 31-41 tools get validated + sanitized on every single request.
- Messages are append-only. Turn N's messages are a strict prefix of turn N+1's. Previously-validated messages get re-validated every turn.
Sub-issues
Approach: process-level LRU caching
Phase 1: Tool list caching (highest impact, smallest change)
Cache at two levels using content-hash of the tools list:
_convert_tools_from_p() — cache provider_tools → IR tools conversion result
_apply_tool_config() / ir_tool_definition_to_p() — cache IR tools → provider tools conversion result (includes sanitize_schema)
This skips both validation and sanitization for tools on cache hit. Expected to eliminate 60-70% of total conversion cost for tool-heavy payloads.
Phase 2: Incremental message validation (second priority)
Per-message hash-based validation cache:
- Hash each message dict individually
- On
validate_ir_request, only validate messages not seen in the LRU cache
- Growing conversations only pay validation cost for new messages
Implementation notes
- Use process-level
functools.lru_cache or a simple dict with bounded size
- Hash strategy:
hash(json.dumps(obj, sort_keys=True)) or structural fingerprint — need to benchmark hash cost vs validation cost
- No persistence needed initially — process restart clears cache, which is fine
- Thread safety: converters are used from async handlers but in a single-threaded event loop, so no locking needed
- Cache size: bounded LRU (e.g. 256 entries for tools, 4096 for messages) to prevent unbounded growth
Measurement
Profile data from benchmarks/bench_real_payload.py and benchmarks/bench_real_litellm_comparison.py. Key command:
conda activate llm-rosetta && cd benchmarks && python bench_real_litellm_comparison.py
Target: bring the 6.3x gap vs LiteLLM down to <2x on real payloads.
Non-goals (for now)
- Persistent (disk-backed) cache — only if process-level cache proves effective
- Cross-process shared cache — not needed for single-process gateway
- Disabling validation entirely — cache approach preserves correctness guarantees
Problem
Profiling shows that 88% of conversion time is spent on two operations that produce identical results across consecutive requests in the same conversation:
_vendor/validate.py)base/schema.py)This makes Rosetta ~6.3x slower than LiteLLM on real-world payloads (64-msg: 2.4ms vs 0.4ms; 218-msg: 4.9ms vs 0.8ms).
Why this is wasteful
In a multi-turn agent conversation (e.g. Claude Code with 41 tools, 218 messages):
Sub-issues
Approach: process-level LRU caching
Phase 1: Tool list caching (highest impact, smallest change)
Cache at two levels using content-hash of the tools list:
_convert_tools_from_p()— cacheprovider_tools → IR toolsconversion result_apply_tool_config()/ir_tool_definition_to_p()— cacheIR tools → provider toolsconversion result (includessanitize_schema)This skips both validation and sanitization for tools on cache hit. Expected to eliminate 60-70% of total conversion cost for tool-heavy payloads.
Phase 2: Incremental message validation (second priority)
Per-message hash-based validation cache:
validate_ir_request, only validate messages not seen in the LRU cacheImplementation notes
functools.lru_cacheor a simpledictwith bounded sizehash(json.dumps(obj, sort_keys=True))or structural fingerprint — need to benchmark hash cost vs validation costMeasurement
Profile data from
benchmarks/bench_real_payload.pyandbenchmarks/bench_real_litellm_comparison.py. Key command:Target: bring the 6.3x gap vs LiteLLM down to <2x on real payloads.
Non-goals (for now)