Leaderboard scoring bias: runs with fewer tasks rank higher than comprehensive runs

## Summary

The leaderboard ranks agents by mean score across completed tasks, but does not account for the number of tasks attempted. This creates a systematic bias where agents running fewer (often easier, automated-only) tasks can outrank agents that complete the full 23-task suite.

## Example

- **Agent A** runs 1 task, scores 100% → leaderboard rank: #1
- **Agent B** runs 9 tasks (all automated), scores 95.1% → leaderboard rank: #2  
- **Agent C** runs all 23 tasks (including judge-scored creative tasks like image generation, humanizing text, market research, email triage), scores 94.8% → leaderboard rank: #3

Agent C demonstrably has broader capability, but ranks lowest. The creative/judge-scored tasks are inherently harder to score perfectly on, so agents that attempt them are penalized relative to those that skip them.

## Suggested Fixes

1. **Normalize by task count** — weight the score by `tasks_completed / total_tasks` so comprehensive runs are rewarded
2. **Require minimum task count for ranking** — e.g., only rank runs with ≥N tasks (or `suite=all`)
3. **Separate leaderboards** — group by suite or task count range so comparisons are apples-to-apples
4. **Display task count prominently** — at minimum, show how many tasks each run completed so viewers can judge for themselves

## Why This Matters

PinchBench is a great benchmark, but if the leaderboard incentivizes cherry-picking easy tasks, it undermines the signal. Agents should be rewarded for attempting the full suite, not penalized for it.

Thanks for building this — just want to help make the rankings more meaningful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Leaderboard scoring bias: runs with fewer tasks rank higher than comprehensive runs #53

Summary

Example

Suggested Fixes

Why This Matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Leaderboard scoring bias: runs with fewer tasks rank higher than comprehensive runs #53

Description

Summary

Example

Suggested Fixes

Why This Matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions