Multi-domain benchmarking: AgentBench-Live covers 5 domains beyond coding

PinchBench looks great for evaluating coding agents specifically.

We built [AgentBench-Live](https://github.com/jackjin1997/AgentBench-Live) which extends the concept to 5 domains: Code, Data Analysis, Research, Tool Use, and Multi-step Workflows. Each domain has its own scoring methodology (auto tests, LLM-as-Judge, composite).

Key insight from our benchmarks: code tasks are essentially "solved" (both Claude Code and Gemini CLI score 1.00), but agents diverge significantly on complex tasks requiring planning, data interpretation, and tool orchestration.

Would be great to explore integrating PinchBench's coding benchmarks as a domain in AgentBench-Live, or vice versa. Open to collaboration!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-domain benchmarking: AgentBench-Live covers 5 domains beyond coding #64

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Multi-domain benchmarking: AgentBench-Live covers 5 domains beyond coding #64

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions