Problem
There's at least one task in PinchBench that has a 95% failure rate across 90% of models. This skews results and wastes compute.
Action Needed
Identify which task this is and either:
- Fix the underlying issue (if it's a test bug)
- Remove it from the benchmark suite (if it's genuinely too hard)
Notes
Look for tasks with consistently low pass rates across diverse model types (not just small/cheap models).