PinchBench looks great for evaluating coding agents specifically.
We built AgentBench-Live which extends the concept to 5 domains: Code, Data Analysis, Research, Tool Use, and Multi-step Workflows. Each domain has its own scoring methodology (auto tests, LLM-as-Judge, composite).
Key insight from our benchmarks: code tasks are essentially "solved" (both Claude Code and Gemini CLI score 1.00), but agents diverge significantly on complex tasks requiring planning, data interpretation, and tool orchestration.
Would be great to explore integrating PinchBench's coding benchmarks as a domain in AgentBench-Live, or vice versa. Open to collaboration!