Skip to content

fix(engine): bound tool results in model context on replay; 1h cache TTL#362

Merged
mgoldsborough merged 5 commits into
mainfrom
fix/tool-result-replay-bound
Jun 2, 2026
Merged

fix(engine): bound tool results in model context on replay; 1h cache TTL#362
mgoldsborough merged 5 commits into
mainfrom
fix/tool-result-replay-bound

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

@mgoldsborough mgoldsborough commented Jun 2, 2026

Problem

A small number of long-lived conversations were driving the overwhelming majority of token cost. Root cause, traced end to end:

  • Tool results are capped at MAX_TOOL_RESULT_CHARS (50K) only inside the live engine loop (engine.ts).
  • The full output is persisted to tool.done.output, and the history reconstructor replays that full value verbatim (event-reconstructor.ts) with no cap.
  • So on every subsequent run, the model is re-fed the entire payload — more than it even saw live. Long threads balloon toward the 1M context ceiling, and with prompt caching the giant prefix gets re-written into the cache on every turn (cache-write was ~80% of spend; write:read ratio ~2:1, backwards).

tool.done.output was a single field serving two masters: the UI/record (wants the full payload) and model replay (wants a bound).

Fix

Separate the two concerns:

  • output (full) — UI/display + conversation record. Unchanged.
  • modelOutput (bounded) — exactly what the model saw on the live turn. New optional field on ToolDoneEvent, persisted only when it differs from output.

A single shared boundToolResultForModel() (in content-helpers.ts, next to extractTextForModel) produces the bound. The engine and the reconstructor both call it, so the model's live view and replayed view are byte-identical. Legacy events without modelOutput fall back to bounding output at read time, so existing conversations are fixed too. The bound is pure/deterministic → the replayed prompt prefix stays stable and cacheable. It trims on line boundaries (never mid-record) and preserves the inline-UI pointer behavior for new events (legacy UI-tool events without a persisted modelOutput are line-trimmed on replay — still bounded, strictly better than the prior full-payload replay).

Framed precisely: this is a replay-fidelity fix — today replay gives the model more than the live run did. It happens to also eliminate the runaway context growth.

Also: 1h prompt-cache TTL

Both Anthropic cache breakpoints move from the 5-minute default to ttl: "1h". Agentic turns pause far longer than 5 minutes (user steps away; automation waits on I/O), so the prefix constantly lapses and the next turn re-writes the whole thing at the write rate. 1h keeps the prefix alive across those gaps, converting full re-writes into cheap reads.

Tests

  • boundToolResultForModel: under-limit passthrough, line-boundary trim, single-huge-line hard-slice fallback, inline-UI pointer, determinism, limit<=0 unbounded.
  • Reconstructor replay: large legacy result bounded; modelOutput replayed verbatim; small result unchanged; determinism; UI metadata still carries full output.
  • Updated existing cache-control assertions (now ttl: "1h") and the prompt-injection truncation-notice assertion (new marker wording). Security property (injection beyond the bound is dropped) preserved.

tsc --noEmit clean, biome clean on src/, full unit suite green except one unrelated pre-existing missing-dep (dompurify) in an automations-UI test this PR doesn't touch.

Rollout note

The model bound takes effect immediately for all conversations (new events store modelOutput; legacy events are bounded on read). The 1h TTL is fleet-wide. No tenant config change required. Expected to cut the dominant cache-write cost substantially and stop context from ballooning on long threads.

Large tool results were capped at 50K chars only inside the live engine
loop; the full output was persisted to tool.done and replayed verbatim by
the history reconstructor on every subsequent run. Long conversations
therefore re-fed the model the entire payload each turn, ballooning context
toward the 1M ceiling and dominating cost via repeated cache writes.

Separate the two concerns the tool.done `output` field was conflating:
- `output` (full) — UI/display and the conversation record. Unchanged.
- `modelOutput` (bounded) — what the model actually saw. New optional field,
  persisted only when it differs from `output`.

A single shared boundToolResultForModel() produces the bound; the engine and
the history reconstructor both call it, so the model's live view and its
replayed view of a result are byte-identical. Legacy events without
`modelOutput` fall back to bounding `output` at read time, fixing existing
conversations too. The bound is pure and deterministic, keeping the replayed
prompt prefix stable and cacheable.

Also raise the Anthropic prompt-cache TTL from the 5-minute default to 1h so
the cached prefix survives the multi-minute gaps between agentic turns,
converting full prefix re-writes into cheap cache reads.

Tests: unit coverage for boundToolResultForModel (line-boundary trim, UI
pointer, determinism, limit<=0) and reconstructor replay (legacy bound,
modelOutput verbatim, small unchanged, determinism). Updated cache-control
and truncation-notice assertions to the new behavior.
@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label Jun 2, 2026
@mgoldsborough mgoldsborough merged commit 053b5c2 into main Jun 2, 2026
5 checks passed
@mgoldsborough mgoldsborough deleted the fix/tool-result-replay-bound branch June 2, 2026 07:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant