Skip to content

Releases: TIGER-AI-Lab/ClawBench

v0.5.0

24 May 17:55
1e913d3

Choose a tag to compare

What's Changed

Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md

v0.4.1

23 May 19:53
67b90b4

Choose a tag to compare

What's Changed

Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md

v0.4.0

23 May 02:36
8e19b25

Choose a tag to compare

What's Changed

  • docs: sync BibTeX key + primaryClass with arXiv listing (READMEs) by @reacher-z in #195
  • chore: corpus-aware issue + PR templates (default v2) by @reacher-z in #198
  • docs: V1 → V2 comparison table (7 axes, sourced from 6 sub-agent audits) by @reacher-z in #197
  • feat: inline LLM judge + V2 leaderboard 6-tab toggle + reproduce-leaderboard recipe by @reacher-z in #196
  • Cleanup repo root #202 by @Perry2004 in #203
  • build: v0.4.0 release by @Perry2004 in #204

Full Changelog: https://github.com/TIGER-AI-Lab/ClawBench/blob/main/CHANGELOG.md

ClawBench v0.1.0

11 Apr 22:31
8e19b25

Choose a tag to compare

ClawBench v1.0.0

Initial public release of ClawBench -- a benchmark for evaluating AI agents on 153 everyday online tasks across 144 live production websites.

Highlights

  • 153 tasks spanning 15 life categories (daily life, travel, education, job search, etc.)
  • 144 live websites -- real production sites, not sandboxed clones
  • Isolated Docker containers with Chromium for each run
  • Request interceptor that blocks the final irreversible action to prevent real-world side effects
  • Five-layer recording: MP4 replay, screenshots, HTTP traffic, DOM actions, agent messages
  • Interactive TUI for model selection, test case picking, and run management
  • 6 frontier models evaluated: Claude Sonnet 4.6 (33.3%), GLM-5 (24.2%), Gemini 3 Flash (19.0%), Claude Haiku 4.5 (18.3%), GPT-5.4 (6.5%), Gemini 3.1 Flash Lite (3.3%)

Links

v0.3.3

20 May 01:55
d003fd4

Choose a tag to compare

What's Changed

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

v0.3.2

16 May 04:31
c991b97

Choose a tag to compare

What's Changed

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

v0.3.1

14 May 01:46
bc29f26

Choose a tag to compare

What's Changed

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

v0.3.0

10 May 00:56
13c1f26

Choose a tag to compare

What's Changed

  • docs: update CD and PiPI news by @Perry2004 in #138
  • feat: inline LLM judge as second scoring stage by @reacher-z in #139
  • feat: add --resume flag to batch runner to skip completed jobs by @erenup in #140
  • style: ruff format judge.py and run.py (unblocks main + all open PRs) by @reacher-z in #141
  • docs: announce ClawBenchV1Trace dataset + add Datasets section by @reacher-z in #142
  • Feat/126 support pi coding agent by @Perry2004 in #143

New Contributors

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

v0.2.2

09 May 01:18
c887bf6

Choose a tag to compare

What's Changed

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md

v0.2.1

09 May 00:37
1c1b58d

Choose a tag to compare

What's Changed

  • docs: post v0.2.0 release doc by @Perry2004 in #134
  • fix: include static assets into the build artifact so markdowns rende… by @Perry2004 in #136

Full Changelog: https://github.com/reacher-z/ClawBench/blob/main/CHANGELOG.md