history: per-message cost, tokens, and latency tracking by akrentsel · Pull Request #16 · ankrgyl/exo

akrentsel · 2026-05-24T22:02:52Z

Summary

Adds an optional UsageRecord to every EventData::Messages event so we have a durable, per-message record of:

Model id (echoed by the provider, or the requested binding as fallback)
Raw token counts — prompt, completion, cached, cache-creation, reasoning
USD cost — computed at call time from the LiteLLM pricing database (loaded at runtime, see below)
TTFT + wall-clock duration — measured in the executor on both streaming and non-streaming paths
server_duration_ms — reserved (lingua does not yet surface a provider-reported processing time)

Pricing: why runtime LiteLLM JSON, not a hardcoded table

OpenAI and Anthropic standard APIs return tokens but not USD in their per-call responses. (OpenAI never; Anthropic only in a separate aggregate Admin API.) So cost must always be computed downstream.

Initial version hardcoded a small price table in Rust. That was a mistake — by the time I wrote the first commit, my table already had Claude Opus 4.7 at $15/$75 per MTok when the real price had dropped to $5/$25. Hand-maintained tables drift.

This version loads LiteLLM's pricing database (2,739 model entries, community-maintained, covers all major providers + Bedrock/Azure/Vertex regional variants):

First use: HTTP fetch into `$XDG_CACHE_HOME/exo/litellm_prices.json`
Subsequent uses (within 24h): cached
After 24h: re-fetch, fall back to stale cache if network fails
All fails: empty table → tokens persist, `cost_usd: None` (no crash)
Env var overrides: `EXO_LITELLM_PRICES_PATH` (local file) and `EXO_LITELLM_PRICES_URL` (alternate source)

Cached-vs-fresh: per-provider accounting matters

Different providers report cached tokens with different conventions, and getting this wrong distorts cost by up to ~10× on cache-heavy requests:

Anthropic-family (anthropic, bedrock_converse, vertex_ai-anthropic_models, azure_ai): prompt_tokens is fresh input only. cache_read and cache_creation are separate. Bill all three additively.
OpenAI-family (openai, mistral, etc.): prompt_tokens is total (including cached). Cached is a subset. Must subtract cached from prompt before billing fresh-input rate.

The first commit got OpenAI wrong (used the additive formula universally). This version branches on LiteLLM's `litellm_provider` field and applies the correct formula. Both formulas have dedicated unit tests with realistic token mixes.

Architecture

`exoharness::pricing` (pure data + math, no network, stays wasm-compatible):
- `PricingTable::from_json_str` parses LiteLLM's schema (silently ignores entries that don't have per-token rates, like image/embedding entries).
- `PricingTable::lookup` does exact + longest-prefix match (dated revisions like `claude-sonnet-4-6-20251022` resolve to `claude-sonnet-4-6`).
- `PricingTable::compute_cost_usd` does per-provider math.
`executor::pricing_loader` (network layer, gated by `tokio::sync::OnceCell`):
- `get_pricing_table()` returns `Arc`; loads once per process, then zero-cost.
- Resolution: `EXO_LITELLM_PRICES_PATH` → cache (fresh) → fetch → stale cache → empty.
`BasicExecutor::with_pricing` + `BasicHarness::with_pricing_table`: explicit-table constructors. Bypass the loader, useful for tests/embedders/air-gapped deployments.

What's in the event JSON now

```json
{
"type": "messages",
"messages": [...],
"response_id": "01J...",
"usage": {
"model": "claude-sonnet-4-6",
"prompt_tokens": 2847,
"completion_tokens": 412,
"prompt_cached_tokens": 12500,
"cost_usd": 0.0146985,
"ttft_ms": 842,
"duration_ms": 3210
}
}
```

All `usage` sub-fields are `Option` + `skip_serializing_if`. Legacy events with no `usage` key continue to deserialize.

Test plan

`cargo test --workspace` — 66 tests pass (was 60 on main; +11 pricing units, +1 end-to-end cost assertion, +2 backward-compat for UsageRecord JSON, -7 from refactoring the previous hardcoded-table tests away).
Pricing unit tests cover: Anthropic additive (no cache, cache hits, cache creation), OpenAI inclusive (with and without cache, subtraction semantics asserted), Bedrock regional surcharge, longest-prefix lookup, sample_spec doc entry skipped, unknown models, provider-style classification.
`pricing_loader::tests::local_path_override_is_honored` exercises the env-var override path.
End-to-end test `usage_record_is_persisted_with_computed_cost` uses `with_pricing_table` to inject an inline fixture — assertion is hermetic, no network dependency in CI.
`cargo test --package exo --test integration_chat -- --ignored` (mocked OpenAI + real local-process sandbox): unchanged, passes.
Loader exercised live: fetch succeeded, cache populated, subsequent calls cache-hit.

Not in scope (intentional)

`/cost` REPL command — persisting silently for now. UI surface is an easy follow-up.
Server-reported duration — lingua doesn't expose this today; field reserved for when it does.
Cache tier resolution — `UniversalUsage` collapses Anthropic's 5-min and 1-hour cache writes into one count. Cost defaults to the 5-minute rate. `cache_creation_input_token_cost_above_1hr` is parsed but unused.
Anthropic's aggregate Usage/Cost Admin API — separate org-level endpoint, aggregate-only, can't be tied to a specific message. Not useful for per-message tracking.

🤖 Generated with Claude Code

akrentsel · 2026-05-24T22:09:34Z

please don't review the code yet – I'm looking into a better way to get pricing details...

akrentsel · 2026-05-25T17:01:55Z

This is addressing #15

…ents Each LLM response now carries an optional UsageRecord that snapshots the model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost computed at call time from a baked-in price table, and both TTFT and wall-clock duration. Why bake in the price table: provider APIs return tokens, not dollars, and prices change. Computing cost downstream means the recorded value can drift or become wrong if rates are revised; computing at call time freezes the price that actually applied to that call. Changes: - crates/exoharness/src/pricing.rs (new): per-model rates (Claude 4.x and OpenAI 4o/o-series for now), longest-prefix lookup so dated revisions (e.g. claude-sonnet-4-6-20251022) resolve, compute_cost_usd helper. - crates/exoharness/src/types.rs: UsageRecord struct; EventData::Messages gains optional usage field with serde(default) for backward compat. - crates/executor: ModelResponse threads model + ttft + duration through; BasicExecutor::complete_model_round measures total duration on both streaming and non-streaming paths; interpret_model_response assembles a UsageRecord and attaches it to the persisted Messages event. Tests: - 7 unit tests in pricing covering exact and prefix lookup, cost arithmetic, unknown-model handling, and missing-cached-rate fallback. - Round-trip tests proving legacy Messages JSON without `usage` still parses, and that emitted JSON contains the new fields. - End-to-end test (harness_basic_tests::usage_record_is_persisted_with_computed_cost) that runs a fake-model send through BasicHarness and asserts the persisted event carries the expected UsageRecord with computed cost ($0.0105 for 1000 prompt + 500 completion on claude-sonnet-4-6). Not in scope (intentional): /cost slash command in the REPL, server-reported duration (lingua does not yet surface this — field reserved as Option for later). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the hand-maintained pricing table with the community-maintained LiteLLM pricing database (model_prices_and_context_window.json), downloaded on first use and cached for 24h. Two motivations: 1. The hardcoded table was already wrong. Claude Opus 4.7 was listed at $15/$75 per MTok; the real rate is $5/$25 (the table predated a price cut). Drift like this is inevitable with hand maintenance. 2. Cached-vs-fresh pricing requires per-provider accounting that the previous code got wrong for OpenAI: - Anthropic-family: prompt_tokens excludes cached. Bill prompt, cache_read, and cache_creation as additive line items. - OpenAI-family: prompt_tokens includes cached. Must subtract cached before billing fresh-input rate, or cached tokens get billed at ~10x their real rate. Both formulas are now implemented and unit-tested with realistic token mixes from each family. Architecture: - `exoharness::pricing` (pure): `PricingTable::from_json_str` parses LiteLLM's schema; `compute_cost_usd` does the per-provider math. No network, no globals. Stays wasm-compatible. - `executor::pricing_loader` (network layer): global `OnceCell` loads the table on first use. Resolution order: `EXO_LITELLM_PRICES_PATH` override → on-disk cache at $XDG_CACHE_HOME/exo/litellm_prices.json (24h TTL) → reqwest fetch from `EXO_LITELLM_PRICES_URL` → stale cache fallback → empty table (cost = None, tokens still persisted; no crash). - `BasicExecutor::with_pricing` and `BasicHarness::with_pricing_table` inject an explicit table, bypassing the global loader. Used by the end-to-end test for deterministic, hermetic cost assertions. Tests: - 11 pricing unit tests using an inline LiteLLM-shape fixture, covering: Anthropic additive (no cache; cache hits; cache creation), OpenAI inclusive (with and without cache hits, with the subtraction semantics explicitly asserted), Bedrock regional surcharge (10% markup on us.anthropic.* entries), longest-prefix lookup for dated revisions, sample_spec doc entry skipped, unknown models return None, provider-style classification per litellm_provider value. - pricing_loader::tests::local_path_override_is_honored exercises the env-var override path. - Existing end-to-end test now uses with_pricing_table to inject an inline fixture, so the $0.0105 assertion no longer depends on the upstream JSON's current rates or on network availability in CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e paths Pin down the post-lingua boundary at the conversation-log level: - Anthropic additive: fresh prompt + cache_read + cache_creation each bill against their own rate (cost = 0.015 on the fixture). - OpenAI inclusive: prompt_tokens includes cached, must be subtracted before billing the fresh-input rate (cost = 0.0008625, vs 0.0009375 if the executor mistakenly used the additive formula). The two provider conventions are exercised in isolation by the pricing.rs unit tests; these new tests prove the same accounting survives ModelResponse -> UsageRecord -> persisted Messages event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A truncated or garbage cache file (or a 200-with-HTML-body fetch) now resolves to an empty pricing table — cost_usd ends up None — rather than propagating a parse error or filling in wrong numbers. Cost data is best-effort; a bad cache must never take down a turn. Also: only write the cache when the fetched body actually parses, so we never persist garbage that would poison the next run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…cter CI Rebasing onto main (which landed #4/#14/#26/#27) surfaced three things: - New construction sites from main needed the fields this PR adds: `usage: None` on an EventData::Messages in exoharness tests, and `enable_agent_tool_creation: true` on the CreateAgentRequest sites in the cost tests. - #27 turned on `cargo clippy --workspace --all-targets -- -D warnings`. Adding UsageRecord (~170 bytes) inline to EventData::Messages bloated EventData enough to trip `large_enum_variant` on HostToGuestMessage, which transitively embeds it. Box the `usage` field so EventData stays small (serde treats Box<T> identically, so the on-disk JSON is unchanged). Also dropped the redundant `BasicToolRuntime::default()` unit-struct construction. - Reformatted per `cargo fmt` to pass the formatting gate. Local CI now green: fmt, clippy -D warnings, and `cargo test --workspace --all-targets` (78 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tion - ModelResponse no longer derives Default; nothing constructed it that way. - Document that the pricing table is loaded once per process, so a long-running service would need a 24h refresh to avoid stale rates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

akrentsel · 2026-05-27T16:50:15Z

@ankrgyl this is ready to review, added tests

akrentsel · 2026-05-27T17:00:29Z

@ankrgyl this is ready to review, added tests

Alexsun1one

two small things on the price lookup side, otherwise this looks great. the LiteLLM-as-source-of-truth + per-provider accounting story is the right call, and i like that with_pricing keeps embedders/air-gapped runs honest. inline notes below.

Alexsun1one · 2026-05-27T18:12:12Z

+        }
+        self.entries
+            .iter()
+            .filter(|(key, _)| model.starts_with(key.as_str()))


prefix-match without a separator boundary can fall through in surprising ways. e.g. gpt-4o-mini matches gpt-4o (intended) but also matches gpt-4 (not intended), since "gpt-4o".starts_with("gpt-4") is true; if the more specific entry is missing the longest-prefix winner is still wrong.

if the goal is the claude-sonnet-4-6-20251022 -> claude-sonnet-4-6 case, tightening the filter to require model[key.len()..] be empty or start with - / : keeps that behavior and stops gpt-4o from sliding into gpt-4 when an entry is absent.

Alexsun1one · 2026-05-27T18:12:13Z

+        // two when cached==0 (the typical case).
+        match provider {
+            Some(p) if p.starts_with("anthropic") => Self::Additive,
+            Some(p) if p.starts_with("bedrock") => Self::Additive,


this catches bedrock_converse correctly but also catches plain bedrock provider entries, which on Bedrock covers Mistral, Cohere, Meta Llama, AI21, and friends. those follow OpenAI-style inclusive prompt_tokens, not Anthropic-style additive.

the real-world impact is small today because cache_read on non-Anthropic Bedrock models is uncommon. but if/when those providers add caching, the formula will over-count fresh-input tokens.

probably starts_with("bedrock_converse") only, or an explicit allow-list of bedrock-anthropic provider strings.

ankrgyl · 2026-05-30T18:06:49Z

+
+async fn try_load() -> anyhow::Result<PricingTable> {
+    // 1. Local path override — used by tests and air-gapped setups.
+    if let Ok(path) = std::env::var("EXO_LITELLM_PRICES_PATH") {


i prefer all env vars to be parsed through clap, so that someone could propagate the pricing table as a CLI arg too. it also forces all functions (like this one) to be relatively pure

maybe we add this to a skill in the repo somewhere?

akrentsel mentioned this pull request May 25, 2026

OpenAI Responses API: cache misses on every request (no prompt_cache_key sent) #24

Open

akrentsel and others added 5 commits May 26, 2026 22:49

akrentsel force-pushed the feature/message-cost-tracking branch from 0de06df to 9933121 Compare May 26, 2026 22:56

akrentsel marked this pull request as ready for review May 27, 2026 16:31

akrentsel requested a review from ankrgyl May 27, 2026 16:49

akrentsel mentioned this pull request May 27, 2026

cli: /usage (alias /cost) REPL command #29

Open

3 tasks

Alexsun1one reviewed May 27, 2026

View reviewed changes

akrentsel mentioned this pull request May 27, 2026

fix: honor repl --model (#30) + send prompt_cache_key on Responses API (#24) #31

Draft

4 tasks

ankrgyl approved these changes May 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history: per-message cost, tokens, and latency tracking#16

history: per-message cost, tokens, and latency tracking#16
akrentsel wants to merge 6 commits into
mainfrom
feature/message-cost-tracking

akrentsel commented May 24, 2026 •

edited

Loading

Uh oh!

akrentsel commented May 24, 2026

Uh oh!

akrentsel commented May 25, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

Alexsun1one left a comment

Uh oh!

Alexsun1one May 27, 2026

Uh oh!

Alexsun1one May 27, 2026

Uh oh!

ankrgyl May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akrentsel commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Pricing: why runtime LiteLLM JSON, not a hardcoded table

Cached-vs-fresh: per-provider accounting matters

Architecture

What's in the event JSON now

Test plan

Not in scope (intentional)

Uh oh!

akrentsel commented May 24, 2026

Uh oh!

akrentsel commented May 25, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

akrentsel commented May 27, 2026

Uh oh!

Alexsun1one left a comment

Choose a reason for hiding this comment

Uh oh!

Alexsun1one May 27, 2026

Choose a reason for hiding this comment

Uh oh!

Alexsun1one May 27, 2026

Choose a reason for hiding this comment

Uh oh!

ankrgyl May 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akrentsel commented May 24, 2026 •

edited

Loading