Skip to content

Multi-domain benchmarking: AgentBench-Live covers 5 domains beyond coding #64

@jackjin1997

Description

@jackjin1997

PinchBench looks great for evaluating coding agents specifically.

We built AgentBench-Live which extends the concept to 5 domains: Code, Data Analysis, Research, Tool Use, and Multi-step Workflows. Each domain has its own scoring methodology (auto tests, LLM-as-Judge, composite).

Key insight from our benchmarks: code tasks are essentially "solved" (both Claude Code and Gemini CLI score 1.00), but agents diverge significantly on complex tasks requiring planning, data interpretation, and tool orchestration.

Would be great to explore integrating PinchBench's coding benchmarks as a domain in AgentBench-Live, or vice versa. Open to collaboration!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions