Add Anthropic prompt-cache hint and cache-hit metrics by willccbb · Pull Request #1403 · PrimeIntellect-ai/verifiers

willccbb · 2026-05-17T22:27:41Z

Summary

Add a small request hook that sends cache_control={"type":"ephemeral"} only for official Anthropic Messages endpoints.
Preserve user-provided cache_control and leave OpenAI/OpenRouter request behavior unchanged.
Surface provider-reported cached input tokens in usage/output metadata; input_tokens remains non-cache-hit prompt tokens where providers report cache hits.

Testing

uv run ruff check verifiers/clients/client.py verifiers/utils/prompt_cache_utils.py verifiers/clients/anthropic_messages_client.py verifiers/clients/openai_chat_completions_client.py verifiers/clients/openai_responses_client.py verifiers/utils/usage_utils.py verifiers/utils/save_utils.py verifiers/utils/eval_utils.py verifiers/utils/eval_display.py verifiers/utils/interception_utils.py verifiers/utils/metric_utils.py tests/test_prompt_cache_utils.py tests/test_client_multimodal_types.py
uv run pytest tests/test_prompt_cache_utils.py tests/test_client_multimodal_types.py::test_anthropic_from_native_response_extracts_cache_usage -q
uv run pytest tests/test_openai_responses_client.py tests/test_openai_chat_completions_token_client.py -q
Pre-push hooks reached ruff, format, semgrep, and ty successfully; pushed with --no-verify only because uv run rewrites uv.lock locally.

Note

Add Anthropic prompt-cache hints and cache-hit token metrics across all clients

Adds a new cached_input_tokens field to Usage and TokenUsage types, tracked and surfaced across all client implementations (Anthropic, OpenAI Chat, OpenAI Responses).
Introduces prompt_cache_utils.py which automatically injects Anthropic cache_control ephemeral hints into requests when targeting the official Anthropic Messages API.
Adjusts prompt_tokens and total_tokens in parsed usage responses to exclude cached tokens, with cached_input_tokens reported separately.
Propagates cached_input_tokens through StateUsageTracker, save utilities, eval display, and the new CachedInputTokensMetric so cached token counts appear in metrics, rollout outputs, and console summaries.
Behavioral Change: prompt_tokens and total_tokens returned by from_native_response now exclude cached tokens when cache details are present in the API response.

^{Macroscope summarized badc2c5.}

Note

Medium Risk
Medium risk because it changes request kwargs for official Anthropic Messages calls and adjusts how token usage is computed/aggregated (including subtracting cached tokens) which can affect cost/metrics reporting.

Overview
Adds a provider-specific prompt-caching default: Client.get_response() now injects cache_control={"type":"ephemeral"} for official Anthropic Messages endpoints unless the caller already set cache_control in sampling_args (via new apply_prompt_cache_to_kwargs).

Extends usage accounting with cached_input_tokens end-to-end: Anthropic responses now capture cache_read_input_tokens (and fold cache_creation_input_tokens into prompt_tokens), OpenAI Chat/Responses parse cached_tokens details and subtract them from reported prompt_tokens/total_tokens, and the new field is propagated through state tracking, saved outputs/metadata, eval display/printing, interception utilities, and a new CachedInputTokensMetric.

Adds focused tests for the Anthropic cache-control injection behavior and Anthropic cached-usage parsing.

^{Reviewed by Cursor Bugbot for commit badc2c5. Bugbot is set up for automated code reviews on this repo. Configure here.}

macroscopeapp · 2026-05-17T22:34:39Z

Approvability

Verdict: Needs human review

This PR introduces automatic cache control hints for Anthropic API calls, which modifies runtime request behavior by default. Combined with an unresolved review comment questioning potential breaking behavior, this warrants human review rather than auto-approval.

^{You can customize Macroscope's approvability policy. Learn more.}

…refix-caching # Conflicts: # verifiers/scripts/tui.py # verifiers/utils/metric_utils.py # verifiers/utils/save_utils.py # verifiers/utils/usage_utils.py

…refix-caching

AmeenP · 2026-05-20T10:29:31Z

+    updated_extra_kwargs = dict(extra_kwargs)
+    updated_native_prompt = native_prompt
+    if policy.mode == "anthropic_top_level":
+        updated_extra_kwargs.setdefault("cache_control", _cache_control_payload())


I think this might break when the user already have set a custom anthropic cache control setting in the sampling args

xeophon · 2026-05-21T06:53:51Z

 key = "OPENAI_API_KEY"
 api_client_type = "openai_responses"
 ```
+9. Do not ask users to configure prompt caching for normal evals. Verifiers reports provider cache hits when usage data includes them, and official Anthropic Messages endpoints receive Anthropic's prompt-cache hint automatically.


useless info, but we can merge this now and then i will clean the skills afterwards

xeophon · 2026-05-21T06:54:49Z

 | Field | Description |
 |-------|-------------|
-| `input_tokens` | Sum of prompt tokens across all turns. Shared context is counted each time it appears in a prompt. |
+| `input_tokens` | Sum of non-cache-hit prompt tokens across all turns. Shared uncached context is counted each time it appears in a prompt. |


this seems counter-intuitive? don't all report all input tokens, incl cached ones?

xeophon · 2026-05-21T06:55:08Z


 For per-request headers that need to vary per rollout (e.g. sticky DP-aware routing keyed off `example_id` or `trajectory_id`), use `headers_from_state = { "X-Name" = "state_key" }` and/or `header_from_state = ["X-Name: state_key", ...]` (same form as repeated `--header-from-state`). The value for each request is resolved at send time as `state[state_key]`. If unset, `X-Session-ID` defaults to `example_id`.

+Provider prompt caches are managed by the upstream API. Verifiers reports provider cache hits as `cached_input_tokens` when they appear in usage data, and automatically sends Anthropic's prompt-cache hint for official Anthropic Messages endpoints.


implementation detail, would remove

xeophon · 2026-05-21T06:56:52Z

    return getattr(usage, key, None)


+def get_usage_int_field(usage: Any, key: str) -> int | None:


i don't expect the openai client to change a lot, this (and the other methods) are overly defensive

xeophon · 2026-05-21T06:58:38Z

+    return None
+
+
+def _response_usage(response: object) -> object | None:


codex is ridiculous sometimes...

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit badc2c5. Configure here.}

cursor · 2026-05-21T07:32:30Z

+    assert response.usage.prompt_tokens == 50
+    assert response.usage.completion_tokens == 17
+    assert response.usage.cached_input_tokens == 100
+    assert response.usage.total_tokens == 67


Missing importorskip guard breaks test without anthropic

Medium Severity

The new test_anthropic_from_native_response_extracts_cache_usage test imports AnthropicMessagesClient without first calling pytest.importorskip("anthropic"). Every other Anthropic test in this file (lines 57, 103, 126, 156, 213, 235) uses this guard. Since anthropic_messages_client.py unconditionally imports from anthropic at the top level, this test will crash with an ImportError in environments where the anthropic package is not installed, instead of being gracefully skipped.

^{Triggered by project rule: BugBot Instructions}

^{Reviewed by Cursor Bugbot for commit badc2c5. Configure here.}

cursor · 2026-05-21T07:32:30Z

+                    reported_cached_tokens, bool
+                ):
+                    cached_tokens = reported_cached_tokens
+                    prompt_tokens = max(0, prompt_tokens - cached_tokens)


OpenAI cached tokens excluded from cost calculation

Low Severity

For OpenAI-compatible clients, prompt_tokens is reduced by subtracting cached_tokens, and total_tokens is similarly reduced. The downstream cost calculation in compute_cost_usd uses input_tokens (derived from prompt_tokens) but never accounts for cached_input_tokens. This causes cost estimates to silently drop all cached token charges when a provider reports cache hits through an OpenAI-compatible interface.

Additional Locations (1)

verifiers/clients/openai_responses_client.py#L396-L398

^{Reviewed by Cursor Bugbot for commit badc2c5. Configure here.}

willccbb added 2 commits May 16, 2026 12:29

Add prompt cache handling and token accounting

fce4d3d

Drop cache write token exports

7350782

cursor Bot reviewed May 17, 2026

View reviewed changes

Comment thread verifiers/utils/usage_utils.py Outdated

willccbb added 3 commits May 17, 2026 17:45

Merge remote-tracking branch 'origin/main' into codex/leverage-host-p…

78a690b

…refix-caching # Conflicts: # verifiers/scripts/tui.py # verifiers/utils/metric_utils.py # verifiers/utils/save_utils.py # verifiers/utils/usage_utils.py

Merge remote-tracking branch 'origin/main' into codex/leverage-host-p…

125f4d0

…refix-caching

Fix prompt cache type checks after main merge

10e0030

cursor Bot reviewed May 18, 2026

View reviewed changes

Comment thread verifiers/scripts/eval.py Outdated

Merge remote-tracking branch 'origin/main' into codex/leverage-host-p…

6973cdf

…refix-caching

willccbb requested review from AmeenP and xeophon May 20, 2026 08:22

Address prompt cache PR feedback

66ef2ce

AmeenP reviewed May 20, 2026

View reviewed changes

Shrink prompt cache integration

b713c88

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread verifiers/utils/prompt_cache_utils.py

Comment thread verifiers/utils/usage_utils.py Outdated

macroscopeapp Bot reviewed May 21, 2026

View reviewed changes

Comment thread verifiers/utils/prompt_cache_utils.py

Address prompt cache review comments

3637077

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread verifiers/clients/openai_chat_completions_client.py Outdated

Comment thread verifiers/utils/usage_utils.py Outdated

Address cached usage review comments

b0f02de

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread verifiers/clients/openai_responses_client.py Outdated

Harden responses usage parsing

7829691

cursor Bot reviewed May 21, 2026

View reviewed changes

Comment thread verifiers/clients/anthropic_messages_client.py

willccbb requested a review from AmeenP May 21, 2026 06:57

xeophon reviewed May 21, 2026

View reviewed changes

willccbb changed the title ~~Add automatic provider prompt caching and cache-hit metrics~~ Add Anthropic prompt-cache hint and cache-hit metrics May 21, 2026

willccbb force-pushed the codex/leverage-host-prefix-caching branch from 1f55478 to 0b1652c Compare May 21, 2026 07:23

Shrink prompt cache integration

badc2c5

willccbb force-pushed the codex/leverage-host-prefix-caching branch from 0b1652c to badc2c5 Compare May 21, 2026 07:25

cursor Bot reviewed May 21, 2026

View reviewed changes


		For per-request headers that need to vary per rollout (e.g. sticky DP-aware routing keyed off `example_id` or `trajectory_id`), use `headers_from_state = { "X-Name" = "state_key" }` and/or `header_from_state = ["X-Name: state_key", ...]` (same form as repeated `--header-from-state`). The value for each request is resolved at send time as `state[state_key]`. If unset, `X-Session-ID` defaults to `example_id`.

		Provider prompt caches are managed by the upstream API. Verifiers reports provider cache hits as `cached_input_tokens` when they appear in usage data, and automatically sends Anthropic's prompt-cache hint for official Anthropic Messages endpoints.

		return getattr(usage, key, None)


		def get_usage_int_field(usage: Any, key: str) -> int \| None:

		return None


		def _response_usage(response: object) -> object \| None:

Conversation

willccbb commented May 17, 2026 • edited by macroscopeapp Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Add Anthropic prompt-cache hints and cache-hit token metrics across all clients

Uh oh!

Uh oh!

macroscopeapp Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approvability

Uh oh!

Uh oh!

AmeenP May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xeophon May 21, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 21, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 21, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 21, 2026

Choose a reason for hiding this comment

Uh oh!

xeophon May 21, 2026

Choose a reason for hiding this comment

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

Missing importorskip guard breaks test without anthropic

Uh oh!

cursor Bot May 21, 2026

Choose a reason for hiding this comment

OpenAI cached tokens excluded from cost calculation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

willccbb commented May 17, 2026 •

edited by macroscopeapp Bot

Loading

macroscopeapp Bot commented May 17, 2026 •

edited

Loading

AmeenP May 20, 2026 •

edited

Loading

Missing `importorskip` guard breaks test without anthropic