Skip to content

history: per-message cost, tokens, and latency tracking#16

Open
akrentsel wants to merge 6 commits into
mainfrom
feature/message-cost-tracking
Open

history: per-message cost, tokens, and latency tracking#16
akrentsel wants to merge 6 commits into
mainfrom
feature/message-cost-tracking

Conversation

@akrentsel
Copy link
Copy Markdown
Collaborator

@akrentsel akrentsel commented May 24, 2026

Summary

Adds an optional UsageRecord to every EventData::Messages event so we have a durable, per-message record of:

  • Model id (echoed by the provider, or the requested binding as fallback)
  • Raw token counts — prompt, completion, cached, cache-creation, reasoning
  • USD cost — computed at call time from the LiteLLM pricing database (loaded at runtime, see below)
  • TTFT + wall-clock duration — measured in the executor on both streaming and non-streaming paths
  • server_duration_ms — reserved (lingua does not yet surface a provider-reported processing time)

Pricing: why runtime LiteLLM JSON, not a hardcoded table

OpenAI and Anthropic standard APIs return tokens but not USD in their per-call responses. (OpenAI never; Anthropic only in a separate aggregate Admin API.) So cost must always be computed downstream.

Initial version hardcoded a small price table in Rust. That was a mistake — by the time I wrote the first commit, my table already had Claude Opus 4.7 at $15/$75 per MTok when the real price had dropped to $5/$25. Hand-maintained tables drift.

This version loads LiteLLM's pricing database (2,739 model entries, community-maintained, covers all major providers + Bedrock/Azure/Vertex regional variants):

  • First use: HTTP fetch into `$XDG_CACHE_HOME/exo/litellm_prices.json`
  • Subsequent uses (within 24h): cached
  • After 24h: re-fetch, fall back to stale cache if network fails
  • All fails: empty table → tokens persist, `cost_usd: None` (no crash)
  • Env var overrides: `EXO_LITELLM_PRICES_PATH` (local file) and `EXO_LITELLM_PRICES_URL` (alternate source)

Cached-vs-fresh: per-provider accounting matters

Different providers report cached tokens with different conventions, and getting this wrong distorts cost by up to ~10× on cache-heavy requests:

  • Anthropic-family (anthropic, bedrock_converse, vertex_ai-anthropic_models, azure_ai): prompt_tokens is fresh input only. cache_read and cache_creation are separate. Bill all three additively.
  • OpenAI-family (openai, mistral, etc.): prompt_tokens is total (including cached). Cached is a subset. Must subtract cached from prompt before billing fresh-input rate.

The first commit got OpenAI wrong (used the additive formula universally). This version branches on LiteLLM's `litellm_provider` field and applies the correct formula. Both formulas have dedicated unit tests with realistic token mixes.

Architecture

  • `exoharness::pricing` (pure data + math, no network, stays wasm-compatible):

    • `PricingTable::from_json_str` parses LiteLLM's schema (silently ignores entries that don't have per-token rates, like image/embedding entries).
    • `PricingTable::lookup` does exact + longest-prefix match (dated revisions like `claude-sonnet-4-6-20251022` resolve to `claude-sonnet-4-6`).
    • `PricingTable::compute_cost_usd` does per-provider math.
  • `executor::pricing_loader` (network layer, gated by `tokio::sync::OnceCell`):

    • `get_pricing_table()` returns `Arc`; loads once per process, then zero-cost.
    • Resolution: `EXO_LITELLM_PRICES_PATH` → cache (fresh) → fetch → stale cache → empty.
  • `BasicExecutor::with_pricing` + `BasicHarness::with_pricing_table`: explicit-table constructors. Bypass the loader, useful for tests/embedders/air-gapped deployments.

What's in the event JSON now

```json
{
"type": "messages",
"messages": [...],
"response_id": "01J...",
"usage": {
"model": "claude-sonnet-4-6",
"prompt_tokens": 2847,
"completion_tokens": 412,
"prompt_cached_tokens": 12500,
"cost_usd": 0.0146985,
"ttft_ms": 842,
"duration_ms": 3210
}
}
```

All `usage` sub-fields are `Option` + `skip_serializing_if`. Legacy events with no `usage` key continue to deserialize.

Test plan

  • `cargo test --workspace` — 66 tests pass (was 60 on main; +11 pricing units, +1 end-to-end cost assertion, +2 backward-compat for UsageRecord JSON, -7 from refactoring the previous hardcoded-table tests away).
  • Pricing unit tests cover: Anthropic additive (no cache, cache hits, cache creation), OpenAI inclusive (with and without cache, subtraction semantics asserted), Bedrock regional surcharge, longest-prefix lookup, sample_spec doc entry skipped, unknown models, provider-style classification.
  • `pricing_loader::tests::local_path_override_is_honored` exercises the env-var override path.
  • End-to-end test `usage_record_is_persisted_with_computed_cost` uses `with_pricing_table` to inject an inline fixture — assertion is hermetic, no network dependency in CI.
  • `cargo test --package exo --test integration_chat -- --ignored` (mocked OpenAI + real local-process sandbox): unchanged, passes.
  • Loader exercised live: fetch succeeded, cache populated, subsequent calls cache-hit.

Not in scope (intentional)

  • `/cost` REPL command — persisting silently for now. UI surface is an easy follow-up.
  • Server-reported duration — lingua doesn't expose this today; field reserved for when it does.
  • Cache tier resolution — `UniversalUsage` collapses Anthropic's 5-min and 1-hour cache writes into one count. Cost defaults to the 5-minute rate. `cache_creation_input_token_cost_above_1hr` is parsed but unused.
  • Anthropic's aggregate Usage/Cost Admin API — separate org-level endpoint, aggregate-only, can't be tied to a specific message. Not useful for per-message tracking.

🤖 Generated with Claude Code

@akrentsel
Copy link
Copy Markdown
Collaborator Author

please don't review the code yet – I'm looking into a better way to get pricing details...

@akrentsel
Copy link
Copy Markdown
Collaborator Author

This is addressing #15

akrentsel and others added 5 commits May 26, 2026 22:49
…ents

Each LLM response now carries an optional UsageRecord that snapshots the
model id, raw token counts (prompt / completion / cached / cache-creation /
reasoning), USD cost computed at call time from a baked-in price table, and
both TTFT and wall-clock duration.

Why bake in the price table: provider APIs return tokens, not dollars, and
prices change. Computing cost downstream means the recorded value can drift
or become wrong if rates are revised; computing at call time freezes the
price that actually applied to that call.

Changes:
- crates/exoharness/src/pricing.rs (new): per-model rates (Claude 4.x and
  OpenAI 4o/o-series for now), longest-prefix lookup so dated revisions
  (e.g. claude-sonnet-4-6-20251022) resolve, compute_cost_usd helper.
- crates/exoharness/src/types.rs: UsageRecord struct; EventData::Messages
  gains optional usage field with serde(default) for backward compat.
- crates/executor: ModelResponse threads model + ttft + duration through;
  BasicExecutor::complete_model_round measures total duration on both
  streaming and non-streaming paths; interpret_model_response assembles a
  UsageRecord and attaches it to the persisted Messages event.

Tests:
- 7 unit tests in pricing covering exact and prefix lookup, cost arithmetic,
  unknown-model handling, and missing-cached-rate fallback.
- Round-trip tests proving legacy Messages JSON without `usage` still
  parses, and that emitted JSON contains the new fields.
- End-to-end test (harness_basic_tests::usage_record_is_persisted_with_computed_cost)
  that runs a fake-model send through BasicHarness and asserts the
  persisted event carries the expected UsageRecord with computed cost
  ($0.0105 for 1000 prompt + 500 completion on claude-sonnet-4-6).

Not in scope (intentional): /cost slash command in the REPL, server-reported
duration (lingua does not yet surface this — field reserved as Option for
later).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-maintained pricing table with the community-maintained
LiteLLM pricing database (model_prices_and_context_window.json),
downloaded on first use and cached for 24h. Two motivations:

1. The hardcoded table was already wrong. Claude Opus 4.7 was listed at
   $15/$75 per MTok; the real rate is $5/$25 (the table predated a price
   cut). Drift like this is inevitable with hand maintenance.

2. Cached-vs-fresh pricing requires per-provider accounting that the
   previous code got wrong for OpenAI:
   - Anthropic-family: prompt_tokens excludes cached. Bill prompt,
     cache_read, and cache_creation as additive line items.
   - OpenAI-family: prompt_tokens includes cached. Must subtract cached
     before billing fresh-input rate, or cached tokens get billed at
     ~10x their real rate.
   Both formulas are now implemented and unit-tested with realistic
   token mixes from each family.

Architecture:

- `exoharness::pricing` (pure): `PricingTable::from_json_str` parses
  LiteLLM's schema; `compute_cost_usd` does the per-provider math.
  No network, no globals. Stays wasm-compatible.

- `executor::pricing_loader` (network layer): global `OnceCell` loads
  the table on first use. Resolution order: `EXO_LITELLM_PRICES_PATH`
  override → on-disk cache at $XDG_CACHE_HOME/exo/litellm_prices.json
  (24h TTL) → reqwest fetch from `EXO_LITELLM_PRICES_URL` → stale cache
  fallback → empty table (cost = None, tokens still persisted; no
  crash).

- `BasicExecutor::with_pricing` and `BasicHarness::with_pricing_table`
  inject an explicit table, bypassing the global loader. Used by the
  end-to-end test for deterministic, hermetic cost assertions.

Tests:

- 11 pricing unit tests using an inline LiteLLM-shape fixture, covering:
  Anthropic additive (no cache; cache hits; cache creation),
  OpenAI inclusive (with and without cache hits, with the subtraction
  semantics explicitly asserted), Bedrock regional surcharge (10%
  markup on us.anthropic.* entries), longest-prefix lookup for dated
  revisions, sample_spec doc entry skipped, unknown models return None,
  provider-style classification per litellm_provider value.

- pricing_loader::tests::local_path_override_is_honored exercises the
  env-var override path.

- Existing end-to-end test now uses with_pricing_table to inject an
  inline fixture, so the $0.0105 assertion no longer depends on the
  upstream JSON's current rates or on network availability in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e paths

Pin down the post-lingua boundary at the conversation-log level:

- Anthropic additive: fresh prompt + cache_read + cache_creation each
  bill against their own rate (cost = 0.015 on the fixture).
- OpenAI inclusive: prompt_tokens includes cached, must be subtracted
  before billing the fresh-input rate (cost = 0.0008625, vs 0.0009375
  if the executor mistakenly used the additive formula).

The two provider conventions are exercised in isolation by the
pricing.rs unit tests; these new tests prove the same accounting
survives ModelResponse -> UsageRecord -> persisted Messages event.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A truncated or garbage cache file (or a 200-with-HTML-body fetch) now
resolves to an empty pricing table — cost_usd ends up None — rather than
propagating a parse error or filling in wrong numbers. Cost data is
best-effort; a bad cache must never take down a turn. Also: only write
the cache when the fetched body actually parses, so we never persist
garbage that would poison the next run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cter CI

Rebasing onto main (which landed #4/#14/#26/#27) surfaced three things:

- New construction sites from main needed the fields this PR adds:
  `usage: None` on an EventData::Messages in exoharness tests, and
  `enable_agent_tool_creation: true` on the CreateAgentRequest sites in
  the cost tests.

- #27 turned on `cargo clippy --workspace --all-targets -- -D warnings`.
  Adding UsageRecord (~170 bytes) inline to EventData::Messages bloated
  EventData enough to trip `large_enum_variant` on HostToGuestMessage,
  which transitively embeds it. Box the `usage` field so EventData stays
  small (serde treats Box<T> identically, so the on-disk JSON is
  unchanged). Also dropped the redundant `BasicToolRuntime::default()`
  unit-struct construction.

- Reformatted per `cargo fmt` to pass the formatting gate.

Local CI now green: fmt, clippy -D warnings, and `cargo test --workspace
--all-targets` (78 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@akrentsel akrentsel force-pushed the feature/message-cost-tracking branch from 0de06df to 9933121 Compare May 26, 2026 22:56
…tion

- ModelResponse no longer derives Default; nothing constructed it that way.
- Document that the pricing table is loaded once per process, so a
  long-running service would need a 24h refresh to avoid stale rates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@akrentsel akrentsel marked this pull request as ready for review May 27, 2026 16:31
@akrentsel akrentsel requested a review from ankrgyl May 27, 2026 16:49
@akrentsel
Copy link
Copy Markdown
Collaborator Author

@ankrgyl this is ready to review, added tests

@akrentsel
Copy link
Copy Markdown
Collaborator Author

@ankrgyl this is ready to review, added tests

Copy link
Copy Markdown
Contributor

@Alexsun1one Alexsun1one left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two small things on the price lookup side, otherwise this looks great. the LiteLLM-as-source-of-truth + per-provider accounting story is the right call, and i like that with_pricing keeps embedders/air-gapped runs honest. inline notes below.

}
self.entries
.iter()
.filter(|(key, _)| model.starts_with(key.as_str()))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefix-match without a separator boundary can fall through in surprising ways. e.g. gpt-4o-mini matches gpt-4o (intended) but also matches gpt-4 (not intended), since "gpt-4o".starts_with("gpt-4") is true; if the more specific entry is missing the longest-prefix winner is still wrong.

if the goal is the claude-sonnet-4-6-20251022 -> claude-sonnet-4-6 case, tightening the filter to require model[key.len()..] be empty or start with - / : keeps that behavior and stops gpt-4o from sliding into gpt-4 when an entry is absent.

// two when cached==0 (the typical case).
match provider {
Some(p) if p.starts_with("anthropic") => Self::Additive,
Some(p) if p.starts_with("bedrock") => Self::Additive,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this catches bedrock_converse correctly but also catches plain bedrock provider entries, which on Bedrock covers Mistral, Cohere, Meta Llama, AI21, and friends. those follow OpenAI-style inclusive prompt_tokens, not Anthropic-style additive.

the real-world impact is small today because cache_read on non-Anthropic Bedrock models is uncommon. but if/when those providers add caching, the formula will over-count fresh-input tokens.

probably starts_with("bedrock_converse") only, or an explicit allow-list of bedrock-anthropic provider strings.


async fn try_load() -> anyhow::Result<PricingTable> {
// 1. Local path override — used by tests and air-gapped setups.
if let Ok(path) = std::env::var("EXO_LITELLM_PRICES_PATH") {
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer all env vars to be parsed through clap, so that someone could propagate the pricing table as a CLI arg too. it also forces all functions (like this one) to be relatively pure

maybe we add this to a skill in the repo somewhere?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants