fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24) by akrentsel · Pull Request #31 · ankrgyl/exo

akrentsel · 2026-05-27T18:25:14Z

Fixes two bugs found while building the cost-tracking stack. Stacked on #29 (usage-command), since the before/after is demonstrated via that PR's /usage command. Two independent commits.

#30 — `exo repl --model X` silently ignored on an existing agent

exo repl reused an existing agent and dropped --model with no feedback, so you'd think you'd switched models but hadn't. Now an explicit --model that differs from the agent's stored model updates it and prints a notice:

updated agent 'demo' model: gpt-4o-mini -> gpt-5.5

No-op when it already matches. (crates/cli/src/main.rs, the Repl arm.)

#24 — no `prompt_cache_key` on the OpenAI Responses API → ~0% cache hits

The Responses endpoint needs a stable prompt_cache_key to reliably route prompt-cache hits at low request volume; exo never sent one, so cached input was billed at full rate and UsageRecord.prompt_cached_tokens was always 0.

Fix: a per-conversation cache key.

ModelRequest gains a prompt_cache_key field.
build_model_request sets it to the conversation id.
build_universal_request injects it into lingua's per-format extras (prompt_cache_key) for the Responses API.

Per-conversation so each turn reuses the growing shared prefix (system prompt + tools + prior history), and each conversation gets its own key — staying well under OpenAI's ~15 req/min-per-key guidance. RLM paths pass None (different caching dynamics, out of scope). No lingua fork needed — the passthrough already exists.

Verified live (gpt-4o-mini, ~2.5k-token shared prefix, two turns in one conversation)

call	prompt	cached
1	2513	0
2	2541	2304

/usage after: input : 5,054 tokens (2,304 cached) — the savings PR16's cost tracking was built to surface.

Test plan

cargo test --workspace (84 tests) — adds responses_request_includes_prompt_cache_key_when_set / ..._omits_..._when_unset, which assert the key lands in (and stays out of) the serialized Responses body.
cargo fmt --all -- --check
cargo clippy --workspace --all-targets -- -D warnings
Manual: exo repl --model X is silently ignored when the agent already exists #30 model-switch notice + persisted model change; OpenAI Responses API: cache misses on every request (no prompt_cache_key sent) #24 cache-hit jump shown above.

Stacking note

Base is usage-command (#29), which is itself on feature/message-cost-tracking (#16). Merge order: #16 → #29 → this. Bases auto-retarget as each lands.

🤖 Generated with Claude Code

When the target agent already exists, the REPL reused it and silently dropped `--model`. Now an explicit `--model` that differs from the agent's stored model updates it and prints a notice (`updated agent 'X' model: A -> B`); a no-op when it already matches.

The Responses endpoint sees ~0% prompt-cache hit rate at low request volume unless a stable `prompt_cache_key` is sent. exo never sent one, so cached input was billed at full rate (and PR16's UsageRecord always showed prompt_cached_tokens: 0). Add a per-conversation cache key: `ModelRequest.prompt_cache_key` is set to the conversation id in `build_model_request`, and `build_universal_request` injects it into lingua's per-format extras as `prompt_cache_key` for the Responses API. Per-conversation so each turn reuses the growing shared prefix (system prompt + tools + prior history) while staying well under OpenAI's ~15 req/min-per-key guidance. RLM paths pass None (different caching dynamics, out of scope). Verified live: two turns sharing a ~2.5k-token prefix on gpt-4o-mini report cached tokens 0 -> 2304 on the second call. Unit tests assert the key lands in (and stays out of) the serialized Responses body.

ankrgyl · 2026-05-30T18:18:58Z

+                        if config.model != resolved {
+                            let previous = std::mem::replace(&mut config.model, resolved.clone());
+                            agent.put_config(config).await?;
+                            println!(
+                                "updated agent '{agent_slug}' model: {previous} -> {resolved}"
+                            );
+                        }


i don't think it should side-effect and change the agent config

ankrgyl · 2026-05-30T18:20:04Z

+    if let Some(cache_key) = &request.prompt_cache_key {
+        let mut responses_extras = lingua_json::Map::new();
+        responses_extras.insert(
+            "prompt_cache_key".to_string(),
+            lingua_json::Value::String(cache_key.clone()),
+        );
+        params
+            .extras
+            .insert(ProviderFormat::Responses, responses_extras);
+    }
+


oh hm i think we should probably support this in lingua. but we don't need to block this PR on that.

akrentsel added 2 commits May 27, 2026 20:00

akrentsel force-pushed the fix-model-and-cache branch from a23f228 to b684a04 Compare May 27, 2026 20:00

ankrgyl mentioned this pull request May 30, 2026

fix: honour --model flag when repl agent already exists #35

Merged

ankrgyl reviewed May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24)#31

fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24)#31
akrentsel wants to merge 2 commits into
usage-commandfrom
fix-model-and-cache

akrentsel commented May 27, 2026

Uh oh!

ankrgyl May 30, 2026

Uh oh!

ankrgyl May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

akrentsel commented May 27, 2026

#30 — exo repl --model X silently ignored on an existing agent

#24 — no prompt_cache_key on the OpenAI Responses API → ~0% cache hits

Verified live (gpt-4o-mini, ~2.5k-token shared prefix, two turns in one conversation)

Test plan

Stacking note

Uh oh!

ankrgyl May 30, 2026

Choose a reason for hiding this comment

Uh oh!

ankrgyl May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

#30 — `exo repl --model X` silently ignored on an existing agent

#24 — no `prompt_cache_key` on the OpenAI Responses API → ~0% cache hits