Add Anthropic prompt-cache hint and cache-hit metrics#1403
Conversation
ApprovabilityVerdict: Needs human review This PR introduces automatic cache control hints for Anthropic API calls, which modifies runtime request behavior by default. Combined with an unresolved review comment questioning potential breaking behavior, this warrants human review rather than auto-approval. You can customize Macroscope's approvability policy. Learn more. |
…refix-caching # Conflicts: # verifiers/scripts/tui.py # verifiers/utils/metric_utils.py # verifiers/utils/save_utils.py # verifiers/utils/usage_utils.py
| updated_extra_kwargs = dict(extra_kwargs) | ||
| updated_native_prompt = native_prompt | ||
| if policy.mode == "anthropic_top_level": | ||
| updated_extra_kwargs.setdefault("cache_control", _cache_control_payload()) |
There was a problem hiding this comment.
I think this might break when the user already have set a custom anthropic cache control setting in the sampling args
| key = "OPENAI_API_KEY" | ||
| api_client_type = "openai_responses" | ||
| ``` | ||
| 9. Do not ask users to configure prompt caching for normal evals. Verifiers reports provider cache hits when usage data includes them, and official Anthropic Messages endpoints receive Anthropic's prompt-cache hint automatically. |
There was a problem hiding this comment.
useless info, but we can merge this now and then i will clean the skills afterwards
| | Field | Description | | ||
| |-------|-------------| | ||
| | `input_tokens` | Sum of prompt tokens across all turns. Shared context is counted each time it appears in a prompt. | | ||
| | `input_tokens` | Sum of non-cache-hit prompt tokens across all turns. Shared uncached context is counted each time it appears in a prompt. | |
There was a problem hiding this comment.
this seems counter-intuitive? don't all report all input tokens, incl cached ones?
|
|
||
| For per-request headers that need to vary per rollout (e.g. sticky DP-aware routing keyed off `example_id` or `trajectory_id`), use `headers_from_state = { "X-Name" = "state_key" }` and/or `header_from_state = ["X-Name: state_key", ...]` (same form as repeated `--header-from-state`). The value for each request is resolved at send time as `state[state_key]`. If unset, `X-Session-ID` defaults to `example_id`. | ||
|
|
||
| Provider prompt caches are managed by the upstream API. Verifiers reports provider cache hits as `cached_input_tokens` when they appear in usage data, and automatically sends Anthropic's prompt-cache hint for official Anthropic Messages endpoints. |
There was a problem hiding this comment.
implementation detail, would remove
| return getattr(usage, key, None) | ||
|
|
||
|
|
||
| def get_usage_int_field(usage: Any, key: str) -> int | None: |
There was a problem hiding this comment.
i don't expect the openai client to change a lot, this (and the other methods) are overly defensive
| return None | ||
|
|
||
|
|
||
| def _response_usage(response: object) -> object | None: |
There was a problem hiding this comment.
codex is ridiculous sometimes...
1f55478 to
0b1652c
Compare
0b1652c to
badc2c5
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit badc2c5. Configure here.
| assert response.usage.prompt_tokens == 50 | ||
| assert response.usage.completion_tokens == 17 | ||
| assert response.usage.cached_input_tokens == 100 | ||
| assert response.usage.total_tokens == 67 |
There was a problem hiding this comment.
Missing importorskip guard breaks test without anthropic
Medium Severity
The new test_anthropic_from_native_response_extracts_cache_usage test imports AnthropicMessagesClient without first calling pytest.importorskip("anthropic"). Every other Anthropic test in this file (lines 57, 103, 126, 156, 213, 235) uses this guard. Since anthropic_messages_client.py unconditionally imports from anthropic at the top level, this test will crash with an ImportError in environments where the anthropic package is not installed, instead of being gracefully skipped.
Triggered by project rule: BugBot Instructions
Reviewed by Cursor Bugbot for commit badc2c5. Configure here.
| reported_cached_tokens, bool | ||
| ): | ||
| cached_tokens = reported_cached_tokens | ||
| prompt_tokens = max(0, prompt_tokens - cached_tokens) |
There was a problem hiding this comment.
OpenAI cached tokens excluded from cost calculation
Low Severity
For OpenAI-compatible clients, prompt_tokens is reduced by subtracting cached_tokens, and total_tokens is similarly reduced. The downstream cost calculation in compute_cost_usd uses input_tokens (derived from prompt_tokens) but never accounts for cached_input_tokens. This causes cost estimates to silently drop all cached token charges when a provider reports cache hits through an OpenAI-compatible interface.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit badc2c5. Configure here.


Summary
cache_control={"type":"ephemeral"}only for official Anthropic Messages endpoints.cache_controland leave OpenAI/OpenRouter request behavior unchanged.input_tokensremains non-cache-hit prompt tokens where providers report cache hits.Testing
uv run ruff check verifiers/clients/client.py verifiers/utils/prompt_cache_utils.py verifiers/clients/anthropic_messages_client.py verifiers/clients/openai_chat_completions_client.py verifiers/clients/openai_responses_client.py verifiers/utils/usage_utils.py verifiers/utils/save_utils.py verifiers/utils/eval_utils.py verifiers/utils/eval_display.py verifiers/utils/interception_utils.py verifiers/utils/metric_utils.py tests/test_prompt_cache_utils.py tests/test_client_multimodal_types.pyuv run pytest tests/test_prompt_cache_utils.py tests/test_client_multimodal_types.py::test_anthropic_from_native_response_extracts_cache_usage -quv run pytest tests/test_openai_responses_client.py tests/test_openai_chat_completions_token_client.py -q--no-verifyonly becauseuv runrewritesuv.locklocally.Note
Add Anthropic prompt-cache hints and cache-hit token metrics across all clients
cached_input_tokensfield toUsageandTokenUsagetypes, tracked and surfaced across all client implementations (Anthropic, OpenAI Chat, OpenAI Responses).prompt_cache_utils.pywhich automatically injects Anthropiccache_controlephemeral hints into requests when targeting the official Anthropic Messages API.prompt_tokensandtotal_tokensin parsed usage responses to exclude cached tokens, withcached_input_tokensreported separately.cached_input_tokensthroughStateUsageTracker, save utilities, eval display, and the newCachedInputTokensMetricso cached token counts appear in metrics, rollout outputs, and console summaries.prompt_tokensandtotal_tokensreturned byfrom_native_responsenow exclude cached tokens when cache details are present in the API response.Macroscope summarized badc2c5.
Note
Medium Risk
Medium risk because it changes request kwargs for official Anthropic Messages calls and adjusts how token usage is computed/aggregated (including subtracting cached tokens) which can affect cost/metrics reporting.
Overview
Adds a provider-specific prompt-caching default:
Client.get_response()now injectscache_control={"type":"ephemeral"}for official Anthropic Messages endpoints unless the caller already setcache_controlinsampling_args(via newapply_prompt_cache_to_kwargs).Extends usage accounting with
cached_input_tokensend-to-end: Anthropic responses now capturecache_read_input_tokens(and foldcache_creation_input_tokensintoprompt_tokens), OpenAI Chat/Responses parsecached_tokensdetails and subtract them from reportedprompt_tokens/total_tokens, and the new field is propagated through state tracking, saved outputs/metadata, eval display/printing, interception utilities, and a newCachedInputTokensMetric.Adds focused tests for the Anthropic cache-control injection behavior and Anthropic cached-usage parsing.
Reviewed by Cursor Bugbot for commit badc2c5. Bugbot is set up for automated code reviews on this repo. Configure here.