history: per-message cost, tokens, and latency tracking (on self-control) by akrentsel · Pull Request #54 · ankrgyl/exo

akrentsel · 2026-06-12T06:25:16Z

Rebuild of #16 on top of exoclaw-self-control, so cost tracking lands cleanly under the self-modification work.

Why a rebuild

The original #16 commit was accidentally authored against a stale pre-#52 tree, so its diff silently reverted main's AgentCore provider and conversation-list pagination. Both that branch and exoclaw-self-control share base dea9fdb, so the pure feature diff applied here with zero conflicts and no reverts. #16 should be closed in favor of this PR.

What it does

Optional UsageRecord on every EventData::Messages event: model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost, TTFT + wall-clock duration. Cost is policy, computed in userspace, never by the trusted substrate. See docs/cost-tracking-design.md for the full design, including the layering and per-provider math.

All three review decisions from #16 are included: boundary-aware prefix lookup, Bedrock treated as inclusive (additive Bedrock-Claude left as TODO), pricing source threaded through clap with EXO_LITELLM_PRICES_* env fallbacks.

Live verification (gpt-4o, both userspaces)

Rust Basic harness: stored cost_usd matches rate table exactly; duration_ms recorded; user-message events carry usage: null.
Exoclaw / TS harness: same, including completion_reasoning_tokens; TS loader consumed the price cache written by the Rust CLI (shared cache path convention).
Prompt-cache math: second turns with prompt_cached_tokens > 0 on both paths; stored cost matches the inclusive formula (prompt − cached)·in + cached·cache_read + completion·out to the last digit (misclassification as additive would have over-billed ~3×).
Provider-echoed dated model ids (gpt-4o-2024-08-06) resolve via the boundary-aware prefix lookup on both implementations.

Not live-tested (unit-tested only): Anthropic additive path and cache-creation tokens — no direct Anthropic key in the test environment.

Note on docker warm sandboxes

While testing, confirmed that docker warm-sandbox reuse works on this branch across REPL restarts (state persists, no duplicate containers). The standalone fix/docker-warm-sandbox branch (prototype of closed #43) is superseded by exoclaw-self-control's own find_running_docker_warm_sandbox (from "Add Exoclaw sandbox transparency controls") and can be deleted. One nit inherited from that implementation is fixed here in its own commit: the docker listing shelled out without the admin-command timeout the prototype had, while running under the warm_sandboxes mutex — a hung docker daemon would have blocked every sandbox operation in the process. It now goes through run_container_admin_command like the Apple path. Re-verified live after the change.

Known follow-ups (also listed in the design doc): TS path records no ttft_ms/duration_ms; azure_ai provider classified additive (hosts non-Anthropic models too); /usage REPL surface (#29) to be re-ported on top of this.

🤖 Generated with Claude Code

Add an optional UsageRecord to every EventData::Messages event: model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost, and TTFT + wall-clock duration. Fields are Option + skip_serializing_if and the record is boxed; legacy events still parse. Cost is policy, computed in userspace, never by the trusted substrate: - crates/cost: a standalone library with the price-table data model, a self-contained LiteLLM loader (explicit path/url, on-disk cache, degrade-to-empty), and per-provider math. Lookup is boundary-aware so dated revisions resolve without sliding a model onto a shorter neighbor's rate. Anthropic-family bills additively; everything else (including Bedrock, a TODO) is inclusive. - exoharness stays minimal: it holds the UsageRecord schema and persists it verbatim, with no pricing code or dependency. - Basic executor fills cost from a table loaded once at startup and injected via the CLI (--pricing-path / --pricing-url, env as fallback). - The TypeScript harness (exoclaw) has its own self-contained cost port (@exo/model-runtime/cost) that owns its data loading (env override, own cache, own fetch), so per-message cost works there with no dependency on the Rust loader or the trusted layer. RLM is left unwired for now: its multi-call turn has different per-message accounting and is a separate follow-up. Rebuilt from feature/message-cost-tracking on top of exoclaw-self-control (both share base dea9fdb), replacing the original commit that was accidentally authored against a stale pre-#52 tree. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

find_running_docker_warm_sandbox shelled out to docker ps with a raw Command and no timeout, while its caller ensure_warm_sandbox_ready holds the warm_sandboxes mutex — a hung docker daemon would block every sandbox operation in the process indefinitely. Route it through run_container_admin_command with WARM_SANDBOX_CLEANUP_TIMEOUT, matching the Apple Container sibling. Verified live: docker warm-sandbox reuse across REPL restarts still works (state persists, no duplicate containers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The motivating demo for per-message cost tracking composed with exoclaw self-control: the agent runs a tool-heavy repo-health-report task, reads its own usage records via list_conversation_events, diagnoses waste, modifies its own prompts/harness/bindings, rebuilds via the guardian, and re-runs to prove the saving. Includes the run protocol, candidate self-modifications by ambition tier, success gates (cost, quality, honesty, survival), and rails. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

akrentsel · 2026-06-14T04:03:20Z

sorry for this – got created as a PR accidentally, not real. this was reviewed and merged into main in #56

akrentsel and others added 3 commits June 12, 2026 05:46

akrentsel mentioned this pull request Jun 12, 2026

history: per-message cost, tokens, and latency tracking #16

Closed

6 tasks

akrentsel closed this Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history: per-message cost, tokens, and latency tracking (on self-control)#54

history: per-message cost, tokens, and latency tracking (on self-control)#54
akrentsel wants to merge 3 commits into
exoclaw-self-controlfrom
feature/cost-tracking-self-control

akrentsel commented Jun 12, 2026 •

edited

Loading

Uh oh!

akrentsel commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

akrentsel commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why a rebuild

What it does

Live verification (gpt-4o, both userspaces)

Note on docker warm sandboxes

Uh oh!

akrentsel commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

akrentsel commented Jun 12, 2026 •

edited

Loading