Skip to content

Leaderboard scoring bias: runs with fewer tasks rank higher than comprehensive runs #53

@JoeProAI

Description

@JoeProAI

Summary

The leaderboard ranks agents by mean score across completed tasks, but does not account for the number of tasks attempted. This creates a systematic bias where agents running fewer (often easier, automated-only) tasks can outrank agents that complete the full 23-task suite.

Example

Agent C demonstrably has broader capability, but ranks lowest. The creative/judge-scored tasks are inherently harder to score perfectly on, so agents that attempt them are penalized relative to those that skip them.

Suggested Fixes

  1. Normalize by task count — weight the score by tasks_completed / total_tasks so comprehensive runs are rewarded
  2. Require minimum task count for ranking — e.g., only rank runs with ≥N tasks (or suite=all)
  3. Separate leaderboards — group by suite or task count range so comparisons are apples-to-apples
  4. Display task count prominently — at minimum, show how many tasks each run completed so viewers can judge for themselves

Why This Matters

PinchBench is a great benchmark, but if the leaderboard incentivizes cherry-picking easy tasks, it undermines the signal. Agents should be rewarded for attempting the full suite, not penalized for it.

Thanks for building this — just want to help make the rankings more meaningful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions