kv-exec-bench

executable-correctness eval for text generation under KV-cache compression.

the eval suites people use to validate KV compression (RULER, LongBench, Loogle, InfiniteBench, NIAH, the math sets) only ever score token/string overlap, retrieval hits, or an extracted answer. nothing checks whether the output actually runs, or whether a tool call is well-formed.

that's a blind spot for KV compression specifically. substring and boxed-answer metrics can sit flat while a compressed cache quietly breaks structured generation. one dropped token in a function body, or a malformed arguments object, fails hard but barely moves string_match. aggregate accuracy holds, functional correctness degrades.

this measures the two things those suites miss. it runs on top of NVIDIA/kvpress as a dependency (it doesn't patch it), so any press in kvpress works.

the two benchmarks

tool_call picks one tool from a long shared catalog (that catalog is the context the press compresses) and emits a single JSON call. the call is parsed and validated against the tool's JSON schema. no code runs. four metrics, lenient to strict: json_valid, name_match, schema_valid, args_exact.

code_exec completes a HumanEval function and scores unit-test pass@1. this one executes model output, so it's off unless you set KV_EXEC_BENCH_ALLOW_CODE_EXECUTION=1. execution is in a subprocess with a fresh temp cwd, a stripped env, a wall-clock timeout, and (on POSIX) CPU + address-space rlimits. eval-grade isolation, not an adversarial sandbox. run it on throwaway/CI hardware, not a box with secrets in the env.

what it catches

the tool catalog is shared across requests, so a prefill press compresses it once and answers every request from the compressed cache. that's where a dropped or garbled tool definition turns into an invalid call. a single CPU box (AMD Ryzen 5 5500U, no GPU), Qwen2.5-0.5B-Instruct, the inline tool_call set, json_valid (did it parse as a call?) vs schema_valid (is the call correct?):

press	json_valid (base→0.75)	schema_valid (base→0.75)
no press (base)	1.00	0.78
SnapKV	1.00 → 0.89	0.44 → 0.11
Knorm	1.00 → 1.00	0.89 → 0.33
ExpectedAttention	1.00 → 1.00	0.56 → 0.33

for Knorm and ExpectedAttention the parse rate never drops below 1.00 while schema validity falls by half or more: the model keeps emitting things that look like tool calls, but they are increasingly the wrong calls. a string-overlap or parse-only metric never sees that. (code_exec floors at pass@1=0 on a 0.5B model, so the executable contrast needs a capable code model on a GPU.) full grid + figure: scripts/make_figures.py results/results.csv.

paper

a short writeup of this result lives in paper/ (CPU-only study introducing the benchmark). preprint (Zenodo): 10.5281/zenodo.20805561.

install

scorers and dataset builders are light (pandas, jsonschema). the runner pulls the model stack, and kvpress pins transformers<5.3, so keep that in its own env:

pip install -e .            # scorers + tool_call dataset, enough for the tests
pip install -e ".[run]"     # adds kvpress + torch + transformers<5.3 for generation

run

# tool_call, uncompressed baseline
kv-exec-bench tool_call --model Qwen/Qwen2.5-0.5B-Instruct

# tool_call under SnapKV at compression ratio 0.5
kv-exec-bench tool_call --model Qwen/Qwen2.5-0.5B-Instruct --press SnapKV --compression-ratio 0.5

# code_exec (executes model output; opt in, throwaway hardware)
KV_EXEC_BENCH_ALLOW_CODE_EXECUTION=1 \
  kv-exec-bench code_exec --model Qwen/Qwen2.5-0.5B-Instruct --limit 20

--press takes any kvpress press name (SnapKV, StreamingLLM, Knorm, Random, ...) or none. --device cuda:0 on GPU.

from python

from kv_exec_bench import build_tool_call_df, score_tool_call
from kv_exec_bench.runner import run

df = build_tool_call_df()
df = run(df, model="Qwen/Qwen2.5-0.5B-Instruct", press_name="SnapKV", compression_ratio=0.5)
print(score_tool_call(df))

the dataset is six columns (context, question, answer_prefix, answer, task, max_new_tokens); each scorer takes the filled frame and returns a JSON-serializable dict keyed by task. swapping the inline tools for BFCL/xLAM/ToolACE, or adding MBPP next to HumanEval, is a drop-in as long as you produce those columns.

tests

pip install -e ".[dev]"
pytest

scorer tests need no model or GPU; predicted_answer is supplied synthetically.

license

Apache-2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
docs/superpowers		docs/superpowers
kv_exec_bench		kv_exec_bench
paper		paper
results		results
scripts		scripts
tests		tests
.flake8		.flake8
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kv-exec-bench

the two benchmarks

what it catches

paper

install

run

from python

tests

license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

kv-exec-bench

the two benchmarks

what it catches

paper

install

run

from python

tests

license

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages