Skip to content

Update: ClawBench entry — V2 corpus (283 tasks, 163 platforms)#249

Open
reacher-z wants to merge 1 commit into
Jenqyang:mainfrom
reacher-z:update-clawbench-v2
Open

Update: ClawBench entry — V2 corpus (283 tasks, 163 platforms)#249
reacher-z wants to merge 1 commit into
Jenqyang:mainfrom
reacher-z:update-clawbench-v2

Conversation

@reacher-z
Copy link
Copy Markdown
Contributor

ClawBench is already listed but with V1-only numbers. This PR refreshes
the entry to reflect the V2 release that shipped in May 2026.

What changed since the entry was added:

  • Task count: 153 → 283 (V1 153 + V2 130)
  • Platforms: 144 → 163
  • Scoring: was just request interception; now two-stage (final
    HTTP-request interception + LLM judge) so "agent reached the endpoint
    but submitted the wrong thing" no longer counts as a pass.
  • Public leaderboard at https://claw-bench.com with 6 tabs by corpus ×
    harness.

No other entries touched.

@Jenqyang
Copy link
Copy Markdown
Owner

Decision: Content looks acceptable.

Reason:

  1. The update itself is within scope and the revised benchmark description looks reviewable.
  2. The current blocker is only that this PR no longer merges cleanly after recent README changes on main.

Next step: Please sync/rebase the latest main and keep the entry wording neutral, then we can merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants