Releases · TIGER-AI-Lab/ClawBench

24 May 17:55

Perry2004

v0.5.0

1e913d3

v0.5.0 Latest

Latest

What's Changed

ci: add 3 sec delay before download from testpypi for it to refresh by @Perry2004 in #207
feat: display token usage in USD by @Perry2004 in #208
refactor: structural multi-harness by @Perry2004 in #210
fix: make token cost calculation available to all harnesses #211 by @Perry2004 in #212
fix: tui highlight not moving with selection #213 by @Perry2004 in #214
build: v0.5.0 minor release by @Perry2004 in #215

Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md

Contributors

Perry2004

Assets 2

23 May 19:53

Perry2004

v0.4.1

67b90b4

v0.4.1

What's Changed

Fix/201 task mismatch by @Perry2004 in #205
build: v0.4.1 release by @Perry2004 in #206

Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md

Contributors

Perry2004

Assets 2

23 May 02:36

Perry2004

v0.4.0

8e19b25

v0.4.0

What's Changed

docs: sync BibTeX key + primaryClass with arXiv listing (READMEs) by @reacher-z in #195
chore: corpus-aware issue + PR templates (default v2) by @reacher-z in #198
docs: V1 → V2 comparison table (7 axes, sourced from 6 sub-agent audits) by @reacher-z in #197
feat: inline LLM judge + V2 leaderboard 6-tab toggle + reproduce-leaderboard recipe by @reacher-z in #196
Cleanup repo root #202 by @Perry2004 in #203
build: v0.4.0 release by @Perry2004 in #204

Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md

Contributors

reacher-z and Perry2004

Assets 2

11 Apr 22:31

reacher-z

v0.1.0

8e19b25

ClawBench v0.1.0

ClawBench v1.0.0

Initial public release of ClawBench -- a benchmark for evaluating AI agents on 153 everyday online tasks across 144 live production websites.

Highlights

153 tasks spanning 15 life categories (daily life, travel, education, job search, etc.)
144 live websites -- real production sites, not sandboxed clones
Isolated Docker containers with Chromium for each run
Request interceptor that blocks the final irreversible action to prevent real-world side effects
Five-layer recording: MP4 replay, screenshots, HTTP traffic, DOM actions, agent messages
Interactive TUI for model selection, test case picking, and run management
6 frontier models evaluated: Claude Sonnet 4.6 (33.3%), GLM-5 (24.2%), Gemini 3 Flash (19.0%), Claude Haiku 4.5 (18.3%), GPT-5.4 (6.5%), Gemini 3.1 Flash Lite (3.3%)

What's Changed

feat: port some claw-eval tasks #72 by @Perry2004 in #168
Feat/172 more run meta by @Perry2004 in #173
docs: add Contact section to README (en + zh) by @reacher-z in #164
Test/interface sanity test by @Perry2004 in #174
ci: run tests by @Perry2004 in #175
test: tui tests with mock questionary by @Perry2004 in #176
Fix/59 sysconf on windows by @Perry2004 in #177
build: 0.3.3 patch release by @Perry2004 in #178

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

Contributors

reacher-z and Perry2004

Assets 2

16 May 04:31

Perry2004

v0.3.2

c991b97

v0.3.2

What's Changed

fix: remove dangling log files by @Perry2004 in #165
chore: shell format fix by @Perry2004 in #166
build: v0.3.2 release by @Perry2004 in #167

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

Contributors

Perry2004

Assets 2

14 May 01:46

Perry2004

v0.3.1

bc29f26

v0.3.1

What's Changed

docs(news): V2 leaderboard ships + V2Trace dataset announce by @reacher-z in #149
docs: add 'What are you looking for?' audience-targeted entry grid by @reacher-z in #144
docs: scoring logic — 2-stage rubric, judge prompt, reproducibility by @reacher-z in #148
Docs/add logo by @Perry2004 in #151
fix: README scoring.md links → eval/scoring.md by @reacher-z in #150
docs: update readme FAQ by @Perry2004 in #156
docs(news): announce TIGER-Lab/ClawBench canonical Space + collection by @reacher-z in #155
fix: remove ASPCA-related tasks by @Perry2004 in #161
docs: v0.3.1 patch changelog by @Perry2004 in #162
build: v0.3.1 patch release by @Perry2004 in #163

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

Contributors

reacher-z and Perry2004

Assets 2

10 May 00:56

Perry2004

v0.3.0

13c1f26

v0.3.0

What's Changed

docs: update CD and PiPI news by @Perry2004 in #138
feat: inline LLM judge as second scoring stage by @reacher-z in #139
feat: add --resume flag to batch runner to skip completed jobs by @erenup in #140
style: ruff format judge.py and run.py (unblocks main + all open PRs) by @reacher-z in #141
docs: announce ClawBenchV1Trace dataset + add Datasets section by @reacher-z in #142
Feat/126 support pi coding agent by @Perry2004 in #143

New Contributors

@erenup made their first contribution in #140

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

Contributors

erenup, reacher-z, and Perry2004

Assets 2

09 May 01:18

Perry2004

v0.2.2

c887bf6

v0.2.2

What's Changed

chore: migrate package to clawbench-eval by @Perry2004 in #137

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

Contributors

Perry2004

Assets 2

09 May 00:37

Perry2004

v0.2.1

1c1b58d

v0.2.1

What's Changed

docs: post v0.2.0 release doc by @Perry2004 in #134
fix: include static assets into the build artifact so markdowns rende… by @Perry2004 in #136

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

Contributors

Perry2004

Assets 2

Releases: TIGER-AI-Lab/ClawBench

v0.5.0

What's Changed

Contributors

Uh oh!

v0.4.1

What's Changed

Contributors

Uh oh!

v0.4.0

What's Changed

Contributors

Uh oh!

ClawBench v0.1.0

ClawBench v1.0.0

Highlights

Links

Uh oh!

v0.3.3

What's Changed

Contributors

Uh oh!

v0.3.2

What's Changed

Contributors

Uh oh!

v0.3.1

What's Changed

Contributors

Uh oh!

v0.3.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.2

What's Changed

Contributors

Uh oh!

v0.2.1

What's Changed

Contributors

Uh oh!