fix(engine): bound tool results in model context on replay; 1h cache TTL#362
Merged
Conversation
Large tool results were capped at 50K chars only inside the live engine loop; the full output was persisted to tool.done and replayed verbatim by the history reconstructor on every subsequent run. Long conversations therefore re-fed the model the entire payload each turn, ballooning context toward the 1M ceiling and dominating cost via repeated cache writes. Separate the two concerns the tool.done `output` field was conflating: - `output` (full) — UI/display and the conversation record. Unchanged. - `modelOutput` (bounded) — what the model actually saw. New optional field, persisted only when it differs from `output`. A single shared boundToolResultForModel() produces the bound; the engine and the history reconstructor both call it, so the model's live view and its replayed view of a result are byte-identical. Legacy events without `modelOutput` fall back to bounding `output` at read time, fixing existing conversations too. The bound is pure and deterministic, keeping the replayed prompt prefix stable and cacheable. Also raise the Anthropic prompt-cache TTL from the 5-minute default to 1h so the cached prefix survives the multi-minute gaps between agentic turns, converting full prefix re-writes into cheap cache reads. Tests: unit coverage for boundToolResultForModel (line-boundary trim, UI pointer, determinism, limit<=0) and reconstructor replay (legacy bound, modelOutput verbatim, small unchanged, determinism). Updated cache-control and truncation-notice assertions to the new behavior.
…ingle TTL source)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A small number of long-lived conversations were driving the overwhelming majority of token cost. Root cause, traced end to end:
MAX_TOOL_RESULT_CHARS(50K) only inside the live engine loop (engine.ts).tool.done.output, and the history reconstructor replays that full value verbatim (event-reconstructor.ts) with no cap.tool.done.outputwas a single field serving two masters: the UI/record (wants the full payload) and model replay (wants a bound).Fix
Separate the two concerns:
output(full) — UI/display + conversation record. Unchanged.modelOutput(bounded) — exactly what the model saw on the live turn. New optional field onToolDoneEvent, persisted only when it differs fromoutput.A single shared
boundToolResultForModel()(incontent-helpers.ts, next toextractTextForModel) produces the bound. The engine and the reconstructor both call it, so the model's live view and replayed view are byte-identical. Legacy events withoutmodelOutputfall back to boundingoutputat read time, so existing conversations are fixed too. The bound is pure/deterministic → the replayed prompt prefix stays stable and cacheable. It trims on line boundaries (never mid-record) and preserves the inline-UI pointer behavior for new events (legacy UI-tool events without a persistedmodelOutputare line-trimmed on replay — still bounded, strictly better than the prior full-payload replay).Framed precisely: this is a replay-fidelity fix — today replay gives the model more than the live run did. It happens to also eliminate the runaway context growth.
Also: 1h prompt-cache TTL
Both Anthropic cache breakpoints move from the 5-minute default to
ttl: "1h". Agentic turns pause far longer than 5 minutes (user steps away; automation waits on I/O), so the prefix constantly lapses and the next turn re-writes the whole thing at the write rate. 1h keeps the prefix alive across those gaps, converting full re-writes into cheap reads.Tests
boundToolResultForModel: under-limit passthrough, line-boundary trim, single-huge-line hard-slice fallback, inline-UI pointer, determinism,limit<=0unbounded.modelOutputreplayed verbatim; small result unchanged; determinism; UI metadata still carries full output.ttl: "1h") and the prompt-injection truncation-notice assertion (new marker wording). Security property (injection beyond the bound is dropped) preserved.tsc --noEmitclean, biome clean onsrc/, full unit suite green except one unrelated pre-existing missing-dep (dompurify) in an automations-UI test this PR doesn't touch.Rollout note
The model bound takes effect immediately for all conversations (new events store
modelOutput; legacy events are bounded on read). The 1h TTL is fleet-wide. No tenant config change required. Expected to cut the dominant cache-write cost substantially and stop context from ballooning on long threads.