Releases: TIGER-AI-Lab/ClawBench
v0.5.0
What's Changed
- ci: add 3 sec delay before download from testpypi for it to refresh by @Perry2004 in #207
- feat: display token usage in USD by @Perry2004 in #208
- refactor: structural multi-harness by @Perry2004 in #210
- fix: make token cost calculation available to all harnesses #211 by @Perry2004 in #212
- fix: tui highlight not moving with selection #213 by @Perry2004 in #214
- build: v0.5.0 minor release by @Perry2004 in #215
Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md
v0.4.1
What's Changed
- Fix/201 task mismatch by @Perry2004 in #205
- build: v0.4.1 release by @Perry2004 in #206
Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md
v0.4.0
What's Changed
- docs: sync BibTeX key + primaryClass with arXiv listing (READMEs) by @reacher-z in #195
- chore: corpus-aware issue + PR templates (default v2) by @reacher-z in #198
- docs: V1 → V2 comparison table (7 axes, sourced from 6 sub-agent audits) by @reacher-z in #197
- feat: inline LLM judge + V2 leaderboard 6-tab toggle + reproduce-leaderboard recipe by @reacher-z in #196
- Cleanup repo root #202 by @Perry2004 in #203
- build: v0.4.0 release by @Perry2004 in #204
Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md
ClawBench v0.1.0
ClawBench v1.0.0
Initial public release of ClawBench -- a benchmark for evaluating AI agents on 153 everyday online tasks across 144 live production websites.
Highlights
- 153 tasks spanning 15 life categories (daily life, travel, education, job search, etc.)
- 144 live websites -- real production sites, not sandboxed clones
- Isolated Docker containers with Chromium for each run
- Request interceptor that blocks the final irreversible action to prevent real-world side effects
- Five-layer recording: MP4 replay, screenshots, HTTP traffic, DOM actions, agent messages
- Interactive TUI for model selection, test case picking, and run management
- 6 frontier models evaluated: Claude Sonnet 4.6 (33.3%), GLM-5 (24.2%), Gemini 3 Flash (19.0%), Claude Haiku 4.5 (18.3%), GPT-5.4 (6.5%), Gemini 3.1 Flash Lite (3.3%)
Links
- Paper: https://arxiv.org/abs/2604.08523
- Project Page: https://claw-bench.com
v0.3.3
What's Changed
- feat: port some claw-eval tasks #72 by @Perry2004 in #168
- Feat/172 more run meta by @Perry2004 in #173
- docs: add Contact section to README (en + zh) by @reacher-z in #164
- Test/interface sanity test by @Perry2004 in #174
- ci: run tests by @Perry2004 in #175
- test: tui tests with mock questionary by @Perry2004 in #176
- Fix/59 sysconf on windows by @Perry2004 in #177
- build: 0.3.3 patch release by @Perry2004 in #178
Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md
v0.3.2
What's Changed
- fix: remove dangling log files by @Perry2004 in #165
- chore: shell format fix by @Perry2004 in #166
- build: v0.3.2 release by @Perry2004 in #167
Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md
v0.3.1
What's Changed
- docs(news): V2 leaderboard ships + V2Trace dataset announce by @reacher-z in #149
- docs: add 'What are you looking for?' audience-targeted entry grid by @reacher-z in #144
- docs: scoring logic — 2-stage rubric, judge prompt, reproducibility by @reacher-z in #148
- Docs/add logo by @Perry2004 in #151
- fix: README scoring.md links → eval/scoring.md by @reacher-z in #150
- docs: update readme FAQ by @Perry2004 in #156
- docs(news): announce TIGER-Lab/ClawBench canonical Space + collection by @reacher-z in #155
- fix: remove ASPCA-related tasks by @Perry2004 in #161
- docs: v0.3.1 patch changelog by @Perry2004 in #162
- build: v0.3.1 patch release by @Perry2004 in #163
Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md
v0.3.0
What's Changed
- docs: update CD and PiPI news by @Perry2004 in #138
- feat: inline LLM judge as second scoring stage by @reacher-z in #139
- feat: add --resume flag to batch runner to skip completed jobs by @erenup in #140
- style: ruff format judge.py and run.py (unblocks main + all open PRs) by @reacher-z in #141
- docs: announce ClawBenchV1Trace dataset + add Datasets section by @reacher-z in #142
- Feat/126 support pi coding agent by @Perry2004 in #143
New Contributors
Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md
v0.2.2
What's Changed
- chore: migrate package to clawbench-eval by @Perry2004 in #137
Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md
v0.2.1
What's Changed
- docs: post v0.2.0 release doc by @Perry2004 in #134
- fix: include static assets into the build artifact so markdowns rende… by @Perry2004 in #136
Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md