Skip to content

Cache tool conversion and IR validation to eliminate repeated work #276

@Oaklight

Description

@Oaklight

Problem

Profiling shows that 88% of conversion time is spent on two operations that produce identical results across consecutive requests in the same conversation:

Hotspot % of conversion time What it does
IR validation (_vendor/validate.py) 63% Recursively validates IRRequest TypedDict — every tool definition, every message, every content part
Schema sanitization (base/schema.py) 25% Recursively strips unsupported JSON Schema keywords from every tool's parameter schema
Actual conversion logic ~5% Already fast

This makes Rosetta ~6.3x slower than LiteLLM on real-world payloads (64-msg: 2.4ms vs 0.4ms; 218-msg: 4.9ms vs 0.8ms).

Why this is wasteful

In a multi-turn agent conversation (e.g. Claude Code with 41 tools, 218 messages):

  • Tool definitions are identical across all turns. The same 31-41 tools get validated + sanitized on every single request.
  • Messages are append-only. Turn N's messages are a strict prefix of turn N+1's. Previously-validated messages get re-validated every turn.

Sub-issues

Approach: process-level LRU caching

Phase 1: Tool list caching (highest impact, smallest change)

Cache at two levels using content-hash of the tools list:

  1. _convert_tools_from_p() — cache provider_tools → IR tools conversion result
  2. _apply_tool_config() / ir_tool_definition_to_p() — cache IR tools → provider tools conversion result (includes sanitize_schema)

This skips both validation and sanitization for tools on cache hit. Expected to eliminate 60-70% of total conversion cost for tool-heavy payloads.

Phase 2: Incremental message validation (second priority)

Per-message hash-based validation cache:

  • Hash each message dict individually
  • On validate_ir_request, only validate messages not seen in the LRU cache
  • Growing conversations only pay validation cost for new messages

Implementation notes

  • Use process-level functools.lru_cache or a simple dict with bounded size
  • Hash strategy: hash(json.dumps(obj, sort_keys=True)) or structural fingerprint — need to benchmark hash cost vs validation cost
  • No persistence needed initially — process restart clears cache, which is fine
  • Thread safety: converters are used from async handlers but in a single-threaded event loop, so no locking needed
  • Cache size: bounded LRU (e.g. 256 entries for tools, 4096 for messages) to prevent unbounded growth

Measurement

Profile data from benchmarks/bench_real_payload.py and benchmarks/bench_real_litellm_comparison.py. Key command:

conda activate llm-rosetta && cd benchmarks && python bench_real_litellm_comparison.py

Target: bring the 6.3x gap vs LiteLLM down to <2x on real payloads.

Non-goals (for now)

  • Persistent (disk-backed) cache — only if process-level cache proves effective
  • Cross-process shared cache — not needed for single-process gateway
  • Disabling validation entirely — cache approach preserves correctness guarantees

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Priority 1: HighenhancementNew feature or requestperformancePerformance optimization

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions