history: per-message cost, tokens, and latency tracking#16
Conversation
|
please don't review the code yet – I'm looking into a better way to get pricing details... |
|
This is addressing #15 |
…ents Each LLM response now carries an optional UsageRecord that snapshots the model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost computed at call time from a baked-in price table, and both TTFT and wall-clock duration. Why bake in the price table: provider APIs return tokens, not dollars, and prices change. Computing cost downstream means the recorded value can drift or become wrong if rates are revised; computing at call time freezes the price that actually applied to that call. Changes: - crates/exoharness/src/pricing.rs (new): per-model rates (Claude 4.x and OpenAI 4o/o-series for now), longest-prefix lookup so dated revisions (e.g. claude-sonnet-4-6-20251022) resolve, compute_cost_usd helper. - crates/exoharness/src/types.rs: UsageRecord struct; EventData::Messages gains optional usage field with serde(default) for backward compat. - crates/executor: ModelResponse threads model + ttft + duration through; BasicExecutor::complete_model_round measures total duration on both streaming and non-streaming paths; interpret_model_response assembles a UsageRecord and attaches it to the persisted Messages event. Tests: - 7 unit tests in pricing covering exact and prefix lookup, cost arithmetic, unknown-model handling, and missing-cached-rate fallback. - Round-trip tests proving legacy Messages JSON without `usage` still parses, and that emitted JSON contains the new fields. - End-to-end test (harness_basic_tests::usage_record_is_persisted_with_computed_cost) that runs a fake-model send through BasicHarness and asserts the persisted event carries the expected UsageRecord with computed cost ($0.0105 for 1000 prompt + 500 completion on claude-sonnet-4-6). Not in scope (intentional): /cost slash command in the REPL, server-reported duration (lingua does not yet surface this — field reserved as Option for later). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hand-maintained pricing table with the community-maintained
LiteLLM pricing database (model_prices_and_context_window.json),
downloaded on first use and cached for 24h. Two motivations:
1. The hardcoded table was already wrong. Claude Opus 4.7 was listed at
$15/$75 per MTok; the real rate is $5/$25 (the table predated a price
cut). Drift like this is inevitable with hand maintenance.
2. Cached-vs-fresh pricing requires per-provider accounting that the
previous code got wrong for OpenAI:
- Anthropic-family: prompt_tokens excludes cached. Bill prompt,
cache_read, and cache_creation as additive line items.
- OpenAI-family: prompt_tokens includes cached. Must subtract cached
before billing fresh-input rate, or cached tokens get billed at
~10x their real rate.
Both formulas are now implemented and unit-tested with realistic
token mixes from each family.
Architecture:
- `exoharness::pricing` (pure): `PricingTable::from_json_str` parses
LiteLLM's schema; `compute_cost_usd` does the per-provider math.
No network, no globals. Stays wasm-compatible.
- `executor::pricing_loader` (network layer): global `OnceCell` loads
the table on first use. Resolution order: `EXO_LITELLM_PRICES_PATH`
override → on-disk cache at $XDG_CACHE_HOME/exo/litellm_prices.json
(24h TTL) → reqwest fetch from `EXO_LITELLM_PRICES_URL` → stale cache
fallback → empty table (cost = None, tokens still persisted; no
crash).
- `BasicExecutor::with_pricing` and `BasicHarness::with_pricing_table`
inject an explicit table, bypassing the global loader. Used by the
end-to-end test for deterministic, hermetic cost assertions.
Tests:
- 11 pricing unit tests using an inline LiteLLM-shape fixture, covering:
Anthropic additive (no cache; cache hits; cache creation),
OpenAI inclusive (with and without cache hits, with the subtraction
semantics explicitly asserted), Bedrock regional surcharge (10%
markup on us.anthropic.* entries), longest-prefix lookup for dated
revisions, sample_spec doc entry skipped, unknown models return None,
provider-style classification per litellm_provider value.
- pricing_loader::tests::local_path_override_is_honored exercises the
env-var override path.
- Existing end-to-end test now uses with_pricing_table to inject an
inline fixture, so the $0.0105 assertion no longer depends on the
upstream JSON's current rates or on network availability in CI.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e paths Pin down the post-lingua boundary at the conversation-log level: - Anthropic additive: fresh prompt + cache_read + cache_creation each bill against their own rate (cost = 0.015 on the fixture). - OpenAI inclusive: prompt_tokens includes cached, must be subtracted before billing the fresh-input rate (cost = 0.0008625, vs 0.0009375 if the executor mistakenly used the additive formula). The two provider conventions are exercised in isolation by the pricing.rs unit tests; these new tests prove the same accounting survives ModelResponse -> UsageRecord -> persisted Messages event. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A truncated or garbage cache file (or a 200-with-HTML-body fetch) now resolves to an empty pricing table — cost_usd ends up None — rather than propagating a parse error or filling in wrong numbers. Cost data is best-effort; a bad cache must never take down a turn. Also: only write the cache when the fetched body actually parses, so we never persist garbage that would poison the next run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cter CI Rebasing onto main (which landed #4/#14/#26/#27) surfaced three things: - New construction sites from main needed the fields this PR adds: `usage: None` on an EventData::Messages in exoharness tests, and `enable_agent_tool_creation: true` on the CreateAgentRequest sites in the cost tests. - #27 turned on `cargo clippy --workspace --all-targets -- -D warnings`. Adding UsageRecord (~170 bytes) inline to EventData::Messages bloated EventData enough to trip `large_enum_variant` on HostToGuestMessage, which transitively embeds it. Box the `usage` field so EventData stays small (serde treats Box<T> identically, so the on-disk JSON is unchanged). Also dropped the redundant `BasicToolRuntime::default()` unit-struct construction. - Reformatted per `cargo fmt` to pass the formatting gate. Local CI now green: fmt, clippy -D warnings, and `cargo test --workspace --all-targets` (78 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
0de06df to
9933121
Compare
…tion - ModelResponse no longer derives Default; nothing constructed it that way. - Document that the pricing table is loaded once per process, so a long-running service would need a 24h refresh to avoid stale rates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@ankrgyl this is ready to review, added tests |
|
@ankrgyl this is ready to review, added tests |
Alexsun1one
left a comment
There was a problem hiding this comment.
two small things on the price lookup side, otherwise this looks great. the LiteLLM-as-source-of-truth + per-provider accounting story is the right call, and i like that with_pricing keeps embedders/air-gapped runs honest. inline notes below.
| } | ||
| self.entries | ||
| .iter() | ||
| .filter(|(key, _)| model.starts_with(key.as_str())) |
There was a problem hiding this comment.
prefix-match without a separator boundary can fall through in surprising ways. e.g. gpt-4o-mini matches gpt-4o (intended) but also matches gpt-4 (not intended), since "gpt-4o".starts_with("gpt-4") is true; if the more specific entry is missing the longest-prefix winner is still wrong.
if the goal is the claude-sonnet-4-6-20251022 -> claude-sonnet-4-6 case, tightening the filter to require model[key.len()..] be empty or start with - / : keeps that behavior and stops gpt-4o from sliding into gpt-4 when an entry is absent.
| // two when cached==0 (the typical case). | ||
| match provider { | ||
| Some(p) if p.starts_with("anthropic") => Self::Additive, | ||
| Some(p) if p.starts_with("bedrock") => Self::Additive, |
There was a problem hiding this comment.
this catches bedrock_converse correctly but also catches plain bedrock provider entries, which on Bedrock covers Mistral, Cohere, Meta Llama, AI21, and friends. those follow OpenAI-style inclusive prompt_tokens, not Anthropic-style additive.
the real-world impact is small today because cache_read on non-Anthropic Bedrock models is uncommon. but if/when those providers add caching, the formula will over-count fresh-input tokens.
probably starts_with("bedrock_converse") only, or an explicit allow-list of bedrock-anthropic provider strings.
|
|
||
| async fn try_load() -> anyhow::Result<PricingTable> { | ||
| // 1. Local path override — used by tests and air-gapped setups. | ||
| if let Ok(path) = std::env::var("EXO_LITELLM_PRICES_PATH") { |
There was a problem hiding this comment.
i prefer all env vars to be parsed through clap, so that someone could propagate the pricing table as a CLI arg too. it also forces all functions (like this one) to be relatively pure
maybe we add this to a skill in the repo somewhere?
Summary
Adds an optional
UsageRecordto everyEventData::Messagesevent so we have a durable, per-message record of:server_duration_ms— reserved (lingua does not yet surface a provider-reported processing time)Pricing: why runtime LiteLLM JSON, not a hardcoded table
OpenAI and Anthropic standard APIs return tokens but not USD in their per-call responses. (OpenAI never; Anthropic only in a separate aggregate Admin API.) So cost must always be computed downstream.
Initial version hardcoded a small price table in Rust. That was a mistake — by the time I wrote the first commit, my table already had Claude Opus 4.7 at $15/$75 per MTok when the real price had dropped to $5/$25. Hand-maintained tables drift.
This version loads LiteLLM's pricing database (2,739 model entries, community-maintained, covers all major providers + Bedrock/Azure/Vertex regional variants):
Cached-vs-fresh: per-provider accounting matters
Different providers report cached tokens with different conventions, and getting this wrong distorts cost by up to ~10× on cache-heavy requests:
anthropic,bedrock_converse,vertex_ai-anthropic_models,azure_ai):prompt_tokensis fresh input only.cache_readandcache_creationare separate. Bill all three additively.openai,mistral, etc.):prompt_tokensis total (including cached). Cached is a subset. Must subtract cached from prompt before billing fresh-input rate.The first commit got OpenAI wrong (used the additive formula universally). This version branches on LiteLLM's `litellm_provider` field and applies the correct formula. Both formulas have dedicated unit tests with realistic token mixes.
Architecture
`exoharness::pricing` (pure data + math, no network, stays wasm-compatible):
`executor::pricing_loader` (network layer, gated by `tokio::sync::OnceCell`):
`BasicExecutor::with_pricing` + `BasicHarness::with_pricing_table`: explicit-table constructors. Bypass the loader, useful for tests/embedders/air-gapped deployments.
What's in the event JSON now
```json
{
"type": "messages",
"messages": [...],
"response_id": "01J...",
"usage": {
"model": "claude-sonnet-4-6",
"prompt_tokens": 2847,
"completion_tokens": 412,
"prompt_cached_tokens": 12500,
"cost_usd": 0.0146985,
"ttft_ms": 842,
"duration_ms": 3210
}
}
```
All `usage` sub-fields are `Option` + `skip_serializing_if`. Legacy events with no `usage` key continue to deserialize.
Test plan
Not in scope (intentional)
🤖 Generated with Claude Code