history: per-message cost, tokens, and latency tracking (on self-control)#54
Closed
akrentsel wants to merge 3 commits into
Closed
history: per-message cost, tokens, and latency tracking (on self-control)#54akrentsel wants to merge 3 commits into
akrentsel wants to merge 3 commits into
Conversation
Add an optional UsageRecord to every EventData::Messages event: model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost, and TTFT + wall-clock duration. Fields are Option + skip_serializing_if and the record is boxed; legacy events still parse. Cost is policy, computed in userspace, never by the trusted substrate: - crates/cost: a standalone library with the price-table data model, a self-contained LiteLLM loader (explicit path/url, on-disk cache, degrade-to-empty), and per-provider math. Lookup is boundary-aware so dated revisions resolve without sliding a model onto a shorter neighbor's rate. Anthropic-family bills additively; everything else (including Bedrock, a TODO) is inclusive. - exoharness stays minimal: it holds the UsageRecord schema and persists it verbatim, with no pricing code or dependency. - Basic executor fills cost from a table loaded once at startup and injected via the CLI (--pricing-path / --pricing-url, env as fallback). - The TypeScript harness (exoclaw) has its own self-contained cost port (@exo/model-runtime/cost) that owns its data loading (env override, own cache, own fetch), so per-message cost works there with no dependency on the Rust loader or the trusted layer. RLM is left unwired for now: its multi-call turn has different per-message accounting and is a separate follow-up. Rebuilt from feature/message-cost-tracking on top of exoclaw-self-control (both share base dea9fdb), replacing the original commit that was accidentally authored against a stale pre-#52 tree. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
find_running_docker_warm_sandbox shelled out to docker ps with a raw Command and no timeout, while its caller ensure_warm_sandbox_ready holds the warm_sandboxes mutex — a hung docker daemon would block every sandbox operation in the process indefinitely. Route it through run_container_admin_command with WARM_SANDBOX_CLEANUP_TIMEOUT, matching the Apple Container sibling. Verified live: docker warm-sandbox reuse across REPL restarts still works (state persists, no duplicate containers). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The motivating demo for per-message cost tracking composed with exoclaw self-control: the agent runs a tool-heavy repo-health-report task, reads its own usage records via list_conversation_events, diagnoses waste, modifies its own prompts/harness/bindings, rebuilds via the guardian, and re-runs to prove the saving. Includes the run protocol, candidate self-modifications by ambition tier, success gates (cost, quality, honesty, survival), and rails. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
6 tasks
Collaborator
Author
|
sorry for this – got created as a PR accidentally, not real. this was reviewed and merged into main in #56 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Rebuild of #16 on top of
exoclaw-self-control, so cost tracking lands cleanly under the self-modification work.Why a rebuild
The original #16 commit was accidentally authored against a stale pre-#52 tree, so its diff silently reverted main's AgentCore provider and conversation-list pagination. Both that branch and
exoclaw-self-controlshare basedea9fdb, so the pure feature diff applied here with zero conflicts and no reverts. #16 should be closed in favor of this PR.What it does
Optional
UsageRecordon everyEventData::Messagesevent: model id, raw token counts (prompt / completion / cached / cache-creation / reasoning), USD cost, TTFT + wall-clock duration. Cost is policy, computed in userspace, never by the trusted substrate. Seedocs/cost-tracking-design.mdfor the full design, including the layering and per-provider math.All three review decisions from #16 are included: boundary-aware prefix lookup, Bedrock treated as inclusive (additive Bedrock-Claude left as TODO), pricing source threaded through clap with
EXO_LITELLM_PRICES_*env fallbacks.Live verification (gpt-4o, both userspaces)
cost_usdmatches rate table exactly;duration_msrecorded; user-message events carryusage: null.completion_reasoning_tokens; TS loader consumed the price cache written by the Rust CLI (shared cache path convention).prompt_cached_tokens> 0 on both paths; stored cost matches the inclusive formula(prompt − cached)·in + cached·cache_read + completion·outto the last digit (misclassification as additive would have over-billed ~3×).gpt-4o-2024-08-06) resolve via the boundary-aware prefix lookup on both implementations.Not live-tested (unit-tested only): Anthropic additive path and cache-creation tokens — no direct Anthropic key in the test environment.
Note on docker warm sandboxes
While testing, confirmed that docker warm-sandbox reuse works on this branch across REPL restarts (state persists, no duplicate containers). The standalone
fix/docker-warm-sandboxbranch (prototype of closed #43) is superseded byexoclaw-self-control's ownfind_running_docker_warm_sandbox(from "Add Exoclaw sandbox transparency controls") and can be deleted. One nit inherited from that implementation is fixed here in its own commit: the docker listing shelled out without the admin-command timeout the prototype had, while running under the warm_sandboxes mutex — a hung docker daemon would have blocked every sandbox operation in the process. It now goes through run_container_admin_command like the Apple path. Re-verified live after the change.Known follow-ups (also listed in the design doc): TS path records no
ttft_ms/duration_ms;azure_aiprovider classified additive (hosts non-Anthropic models too);/usageREPL surface (#29) to be re-ported on top of this.🤖 Generated with Claude Code