-
Notifications
You must be signed in to change notification settings - Fork 53
Open
Description
This is the meta issue about PinchBench v2 and what would be required to either (1) fix or (2) explicitly decide to not address to then make a new release and feel good about calling it a major release
Skill Changes
- Discussion: What tasks should we add next? #52
- Fix or remove task that fails 95% of the time on 90% of models #51
- Feature: Publish per-task variance and retry counts on leaderboard #8
- benchmark: wait for transcript completion instead of assuming subprocess exit means task finished #65
- Thinking Levels #9
Leaderboard Changes
- Discussion: Average vs. Best score leaderboard#42
- Redesign: Left sidebar with categorized filters, search input, and checkbox-style selections leaderboard#39
- Feature: User profiles and contributor leaderboard leaderboard#36
- Feature: Model landing pages and search leaderboard#35
- Leaderboard scoring bias: runs with fewer tasks rank higher than comprehensive runs #53
- Show overall rank of every run leaderboard#47
- Daily, weekly, monthly badges leaderboard#48
- feat: add task selection filter for benchmark runs leaderboard#22
Other
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels