-
Notifications
You must be signed in to change notification settings - Fork 51
Description
Summary
The leaderboard ranks agents by mean score across completed tasks, but does not account for the number of tasks attempted. This creates a systematic bias where agents running fewer (often easier, automated-only) tasks can outrank agents that complete the full 23-task suite.
Example
- Agent A runs 1 task, scores 100% → leaderboard rank: Add MIT license #1
- Agent B runs 9 tasks (all automated), scores 95.1% → leaderboard rank: Restructure as OpenClaw skill for ClawHub #2
- Agent C runs all 23 tasks (including judge-scored creative tasks like image generation, humanizing text, market research, email triage), scores 94.8% → leaderboard rank: Add friendly README with quick start and overview #3
Agent C demonstrably has broader capability, but ranks lowest. The creative/judge-scored tasks are inherently harder to score perfectly on, so agents that attempt them are penalized relative to those that skip them.
Suggested Fixes
- Normalize by task count — weight the score by
tasks_completed / total_tasksso comprehensive runs are rewarded - Require minimum task count for ranking — e.g., only rank runs with ≥N tasks (or
suite=all) - Separate leaderboards — group by suite or task count range so comparisons are apples-to-apples
- Display task count prominently — at minimum, show how many tasks each run completed so viewers can judge for themselves
Why This Matters
PinchBench is a great benchmark, but if the leaderboard incentivizes cherry-picking easy tasks, it undermines the signal. Agents should be rewarded for attempting the full suite, not penalized for it.
Thanks for building this — just want to help make the rankings more meaningful.