Skip to content

Add Anthropic prompt-cache hint and cache-hit metrics#1403

Open
willccbb wants to merge 12 commits into
mainfrom
codex/leverage-host-prefix-caching
Open

Add Anthropic prompt-cache hint and cache-hit metrics#1403
willccbb wants to merge 12 commits into
mainfrom
codex/leverage-host-prefix-caching

Conversation

@willccbb
Copy link
Copy Markdown
Member

@willccbb willccbb commented May 17, 2026

Summary

  • Add a small request hook that sends cache_control={"type":"ephemeral"} only for official Anthropic Messages endpoints.
  • Preserve user-provided cache_control and leave OpenAI/OpenRouter request behavior unchanged.
  • Surface provider-reported cached input tokens in usage/output metadata; input_tokens remains non-cache-hit prompt tokens where providers report cache hits.

Testing

  • uv run ruff check verifiers/clients/client.py verifiers/utils/prompt_cache_utils.py verifiers/clients/anthropic_messages_client.py verifiers/clients/openai_chat_completions_client.py verifiers/clients/openai_responses_client.py verifiers/utils/usage_utils.py verifiers/utils/save_utils.py verifiers/utils/eval_utils.py verifiers/utils/eval_display.py verifiers/utils/interception_utils.py verifiers/utils/metric_utils.py tests/test_prompt_cache_utils.py tests/test_client_multimodal_types.py
  • uv run pytest tests/test_prompt_cache_utils.py tests/test_client_multimodal_types.py::test_anthropic_from_native_response_extracts_cache_usage -q
  • uv run pytest tests/test_openai_responses_client.py tests/test_openai_chat_completions_token_client.py -q
  • Pre-push hooks reached ruff, format, semgrep, and ty successfully; pushed with --no-verify only because uv run rewrites uv.lock locally.

Note

Add Anthropic prompt-cache hints and cache-hit token metrics across all clients

  • Adds a new cached_input_tokens field to Usage and TokenUsage types, tracked and surfaced across all client implementations (Anthropic, OpenAI Chat, OpenAI Responses).
  • Introduces prompt_cache_utils.py which automatically injects Anthropic cache_control ephemeral hints into requests when targeting the official Anthropic Messages API.
  • Adjusts prompt_tokens and total_tokens in parsed usage responses to exclude cached tokens, with cached_input_tokens reported separately.
  • Propagates cached_input_tokens through StateUsageTracker, save utilities, eval display, and the new CachedInputTokensMetric so cached token counts appear in metrics, rollout outputs, and console summaries.
  • Behavioral Change: prompt_tokens and total_tokens returned by from_native_response now exclude cached tokens when cache details are present in the API response.

Macroscope summarized badc2c5.


Note

Medium Risk
Medium risk because it changes request kwargs for official Anthropic Messages calls and adjusts how token usage is computed/aggregated (including subtracting cached tokens) which can affect cost/metrics reporting.

Overview
Adds a provider-specific prompt-caching default: Client.get_response() now injects cache_control={"type":"ephemeral"} for official Anthropic Messages endpoints unless the caller already set cache_control in sampling_args (via new apply_prompt_cache_to_kwargs).

Extends usage accounting with cached_input_tokens end-to-end: Anthropic responses now capture cache_read_input_tokens (and fold cache_creation_input_tokens into prompt_tokens), OpenAI Chat/Responses parse cached_tokens details and subtract them from reported prompt_tokens/total_tokens, and the new field is propagated through state tracking, saved outputs/metadata, eval display/printing, interception utilities, and a new CachedInputTokensMetric.

Adds focused tests for the Anthropic cache-control injection behavior and Anthropic cached-usage parsing.

Reviewed by Cursor Bugbot for commit badc2c5. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread verifiers/utils/usage_utils.py Outdated
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 17, 2026

Approvability

Verdict: Needs human review

This PR introduces automatic cache control hints for Anthropic API calls, which modifies runtime request behavior by default. Combined with an unresolved review comment questioning potential breaking behavior, this warrants human review rather than auto-approval.

You can customize Macroscope's approvability policy. Learn more.

willccbb added 3 commits May 17, 2026 17:45
…refix-caching

# Conflicts:
#	verifiers/scripts/tui.py
#	verifiers/utils/metric_utils.py
#	verifiers/utils/save_utils.py
#	verifiers/utils/usage_utils.py
Comment thread verifiers/scripts/eval.py Outdated
@willccbb willccbb requested review from AmeenP and xeophon May 20, 2026 08:22
updated_extra_kwargs = dict(extra_kwargs)
updated_native_prompt = native_prompt
if policy.mode == "anthropic_top_level":
updated_extra_kwargs.setdefault("cache_control", _cache_control_payload())
Copy link
Copy Markdown
Collaborator

@AmeenP AmeenP May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might break when the user already have set a custom anthropic cache control setting in the sampling args

Comment thread verifiers/utils/prompt_cache_utils.py
Comment thread verifiers/utils/usage_utils.py Outdated
Comment thread verifiers/utils/prompt_cache_utils.py
Comment thread verifiers/clients/openai_chat_completions_client.py Outdated
Comment thread verifiers/utils/usage_utils.py Outdated
Comment thread verifiers/clients/openai_responses_client.py Outdated
Comment thread verifiers/clients/anthropic_messages_client.py
@willccbb willccbb requested a review from AmeenP May 21, 2026 06:57
Comment thread skills/evaluate-environments/SKILL.md Outdated
key = "OPENAI_API_KEY"
api_client_type = "openai_responses"
```
9. Do not ask users to configure prompt caching for normal evals. Verifiers reports provider cache hits when usage data includes them, and official Anthropic Messages endpoints receive Anthropic's prompt-cache hint automatically.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

useless info, but we can merge this now and then i will clean the skills afterwards

Comment thread docs/reference.md Outdated
| Field | Description |
|-------|-------------|
| `input_tokens` | Sum of prompt tokens across all turns. Shared context is counted each time it appears in a prompt. |
| `input_tokens` | Sum of non-cache-hit prompt tokens across all turns. Shared uncached context is counted each time it appears in a prompt. |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems counter-intuitive? don't all report all input tokens, incl cached ones?

Comment thread docs/evaluation.md Outdated

For per-request headers that need to vary per rollout (e.g. sticky DP-aware routing keyed off `example_id` or `trajectory_id`), use `headers_from_state = { "X-Name" = "state_key" }` and/or `header_from_state = ["X-Name: state_key", ...]` (same form as repeated `--header-from-state`). The value for each request is resolved at send time as `state[state_key]`. If unset, `X-Session-ID` defaults to `example_id`.

Provider prompt caches are managed by the upstream API. Verifiers reports provider cache hits as `cached_input_tokens` when they appear in usage data, and automatically sends Anthropic's prompt-cache hint for official Anthropic Messages endpoints.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

implementation detail, would remove

return getattr(usage, key, None)


def get_usage_int_field(usage: Any, key: str) -> int | None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't expect the openai client to change a lot, this (and the other methods) are overly defensive

Comment thread verifiers/utils/usage_utils.py Outdated
return None


def _response_usage(response: object) -> object | None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codex is ridiculous sometimes...

@willccbb willccbb changed the title Add automatic provider prompt caching and cache-hit metrics Add Anthropic prompt-cache hint and cache-hit metrics May 21, 2026
@willccbb willccbb force-pushed the codex/leverage-host-prefix-caching branch from 1f55478 to 0b1652c Compare May 21, 2026 07:23
@willccbb willccbb force-pushed the codex/leverage-host-prefix-caching branch from 0b1652c to badc2c5 Compare May 21, 2026 07:25
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit badc2c5. Configure here.

assert response.usage.prompt_tokens == 50
assert response.usage.completion_tokens == 17
assert response.usage.cached_input_tokens == 100
assert response.usage.total_tokens == 67
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing importorskip guard breaks test without anthropic

Medium Severity

The new test_anthropic_from_native_response_extracts_cache_usage test imports AnthropicMessagesClient without first calling pytest.importorskip("anthropic"). Every other Anthropic test in this file (lines 57, 103, 126, 156, 213, 235) uses this guard. Since anthropic_messages_client.py unconditionally imports from anthropic at the top level, this test will crash with an ImportError in environments where the anthropic package is not installed, instead of being gracefully skipped.

Fix in Cursor Fix in Web

Triggered by project rule: BugBot Instructions

Reviewed by Cursor Bugbot for commit badc2c5. Configure here.

reported_cached_tokens, bool
):
cached_tokens = reported_cached_tokens
prompt_tokens = max(0, prompt_tokens - cached_tokens)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAI cached tokens excluded from cost calculation

Low Severity

For OpenAI-compatible clients, prompt_tokens is reduced by subtracting cached_tokens, and total_tokens is similarly reduced. The downstream cost calculation in compute_cost_usd uses input_tokens (derived from prompt_tokens) but never accounts for cached_input_tokens. This causes cost estimates to silently drop all cached token charges when a provider reports cache hits through an OpenAI-compatible interface.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit badc2c5. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants