Skip to content

fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24)#31

Draft
akrentsel wants to merge 2 commits into
usage-commandfrom
fix-model-and-cache
Draft

fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24)#31
akrentsel wants to merge 2 commits into
usage-commandfrom
fix-model-and-cache

Conversation

@akrentsel
Copy link
Copy Markdown
Collaborator

Fixes two bugs found while building the cost-tracking stack. Stacked on #29 (usage-command), since the before/after is demonstrated via that PR's /usage command. Two independent commits.

#30exo repl --model X silently ignored on an existing agent

exo repl reused an existing agent and dropped --model with no feedback, so you'd think you'd switched models but hadn't. Now an explicit --model that differs from the agent's stored model updates it and prints a notice:

updated agent 'demo' model: gpt-4o-mini -> gpt-5.5

No-op when it already matches. (crates/cli/src/main.rs, the Repl arm.)

#24 — no prompt_cache_key on the OpenAI Responses API → ~0% cache hits

The Responses endpoint needs a stable prompt_cache_key to reliably route prompt-cache hits at low request volume; exo never sent one, so cached input was billed at full rate and UsageRecord.prompt_cached_tokens was always 0.

Fix: a per-conversation cache key.

  • ModelRequest gains a prompt_cache_key field.
  • build_model_request sets it to the conversation id.
  • build_universal_request injects it into lingua's per-format extras (prompt_cache_key) for the Responses API.

Per-conversation so each turn reuses the growing shared prefix (system prompt + tools + prior history), and each conversation gets its own key — staying well under OpenAI's ~15 req/min-per-key guidance. RLM paths pass None (different caching dynamics, out of scope). No lingua fork needed — the passthrough already exists.

Verified live (gpt-4o-mini, ~2.5k-token shared prefix, two turns in one conversation)

call prompt cached
1 2513 0
2 2541 2304

/usage after: input : 5,054 tokens (2,304 cached) — the savings PR16's cost tracking was built to surface.

Test plan

Stacking note

Base is usage-command (#29), which is itself on feature/message-cost-tracking (#16). Merge order: #16#29 → this. Bases auto-retarget as each lands.

🤖 Generated with Claude Code

akrentsel added 2 commits May 27, 2026 20:00
When the target agent already exists, the REPL reused it and silently
dropped `--model`. Now an explicit `--model` that differs from the agent's
stored model updates it and prints a notice
(`updated agent 'X' model: A -> B`); a no-op when it already matches.
The Responses endpoint sees ~0% prompt-cache hit rate at low request volume
unless a stable `prompt_cache_key` is sent. exo never sent one, so cached
input was billed at full rate (and PR16's UsageRecord always showed
prompt_cached_tokens: 0).

Add a per-conversation cache key: `ModelRequest.prompt_cache_key` is set to
the conversation id in `build_model_request`, and `build_universal_request`
injects it into lingua's per-format extras as `prompt_cache_key` for the
Responses API. Per-conversation so each turn reuses the growing shared
prefix (system prompt + tools + prior history) while staying well under
OpenAI's ~15 req/min-per-key guidance. RLM paths pass None (different
caching dynamics, out of scope).

Verified live: two turns sharing a ~2.5k-token prefix on gpt-4o-mini report
cached tokens 0 -> 2304 on the second call. Unit tests assert the key lands
in (and stays out of) the serialized Responses body.
Comment thread crates/cli/src/main.rs
Comment on lines +520 to +526
if config.model != resolved {
let previous = std::mem::replace(&mut config.model, resolved.clone());
agent.put_config(config).await?;
println!(
"updated agent '{agent_slug}' model: {previous} -> {resolved}"
);
}
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think it should side-effect and change the agent config

Comment on lines +217 to +227
if let Some(cache_key) = &request.prompt_cache_key {
let mut responses_extras = lingua_json::Map::new();
responses_extras.insert(
"prompt_cache_key".to_string(),
lingua_json::Value::String(cache_key.clone()),
);
params
.extras
.insert(ProviderFormat::Responses, responses_extras);
}

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh hm i think we should probably support this in lingua. but we don't need to block this PR on that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants