fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24)#31
Draft
akrentsel wants to merge 2 commits into
Draft
fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24)#31akrentsel wants to merge 2 commits into
akrentsel wants to merge 2 commits into
Conversation
When the target agent already exists, the REPL reused it and silently dropped `--model`. Now an explicit `--model` that differs from the agent's stored model updates it and prints a notice (`updated agent 'X' model: A -> B`); a no-op when it already matches.
The Responses endpoint sees ~0% prompt-cache hit rate at low request volume unless a stable `prompt_cache_key` is sent. exo never sent one, so cached input was billed at full rate (and PR16's UsageRecord always showed prompt_cached_tokens: 0). Add a per-conversation cache key: `ModelRequest.prompt_cache_key` is set to the conversation id in `build_model_request`, and `build_universal_request` injects it into lingua's per-format extras as `prompt_cache_key` for the Responses API. Per-conversation so each turn reuses the growing shared prefix (system prompt + tools + prior history) while staying well under OpenAI's ~15 req/min-per-key guidance. RLM paths pass None (different caching dynamics, out of scope). Verified live: two turns sharing a ~2.5k-token prefix on gpt-4o-mini report cached tokens 0 -> 2304 on the second call. Unit tests assert the key lands in (and stays out of) the serialized Responses body.
a23f228 to
b684a04
Compare
ankrgyl
reviewed
May 30, 2026
Comment on lines
+520
to
+526
| if config.model != resolved { | ||
| let previous = std::mem::replace(&mut config.model, resolved.clone()); | ||
| agent.put_config(config).await?; | ||
| println!( | ||
| "updated agent '{agent_slug}' model: {previous} -> {resolved}" | ||
| ); | ||
| } |
Owner
There was a problem hiding this comment.
i don't think it should side-effect and change the agent config
Comment on lines
+217
to
+227
| if let Some(cache_key) = &request.prompt_cache_key { | ||
| let mut responses_extras = lingua_json::Map::new(); | ||
| responses_extras.insert( | ||
| "prompt_cache_key".to_string(), | ||
| lingua_json::Value::String(cache_key.clone()), | ||
| ); | ||
| params | ||
| .extras | ||
| .insert(ProviderFormat::Responses, responses_extras); | ||
| } | ||
|
|
Owner
There was a problem hiding this comment.
oh hm i think we should probably support this in lingua. but we don't need to block this PR on that.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes two bugs found while building the cost-tracking stack. Stacked on #29 (
usage-command), since the before/after is demonstrated via that PR's/usagecommand. Two independent commits.#30 —
exo repl --model Xsilently ignored on an existing agentexo replreused an existing agent and dropped--modelwith no feedback, so you'd think you'd switched models but hadn't. Now an explicit--modelthat differs from the agent's stored model updates it and prints a notice:No-op when it already matches. (
crates/cli/src/main.rs, theReplarm.)#24 — no
prompt_cache_keyon the OpenAI Responses API → ~0% cache hitsThe Responses endpoint needs a stable
prompt_cache_keyto reliably route prompt-cache hits at low request volume; exo never sent one, so cached input was billed at full rate andUsageRecord.prompt_cached_tokenswas always 0.Fix: a per-conversation cache key.
ModelRequestgains aprompt_cache_keyfield.build_model_requestsets it to the conversation id.build_universal_requestinjects it into lingua's per-format extras (prompt_cache_key) for the Responses API.Per-conversation so each turn reuses the growing shared prefix (system prompt + tools + prior history), and each conversation gets its own key — staying well under OpenAI's ~15 req/min-per-key guidance. RLM paths pass
None(different caching dynamics, out of scope). No lingua fork needed — the passthrough already exists.Verified live (gpt-4o-mini, ~2.5k-token shared prefix, two turns in one conversation)
/usageafter:input : 5,054 tokens (2,304 cached)— the savings PR16's cost tracking was built to surface.Test plan
cargo test --workspace(84 tests) — addsresponses_request_includes_prompt_cache_key_when_set/..._omits_..._when_unset, which assert the key lands in (and stays out of) the serialized Responses body.cargo fmt --all -- --checkcargo clippy --workspace --all-targets -- -D warningsexo repl --model Xis silently ignored when the agent already exists #30 model-switch notice + persisted model change; OpenAI Responses API: cache misses on every request (no prompt_cache_key sent) #24 cache-hit jump shown above.Stacking note
Base is
usage-command(#29), which is itself onfeature/message-cost-tracking(#16). Merge order: #16 → #29 → this. Bases auto-retarget as each lands.🤖 Generated with Claude Code