Skip to content

Are current benchmarks enough to measure research agents? #114

@black-yt

Description

@black-yt

On Evaluating the Emerging Wave of Auto Research Agents

Hi! 👋

The recent surge of auto research agents has been both surprising and impressive.
Systems like Claude Code, Codex CLI, OpenClaw, and others are beginning to demonstrate increasingly strong capabilities in coding, reasoning, and multi-step tool use.

However, this also raises a fundamental question:

How should we rigorously evaluate whether an agent can actually conduct scientific research?

Most existing benchmarks focus on:

  • knowledge recall,
  • reasoning,
  • or code generation,

but they rarely evaluate the end-to-end research process — from raw data to paper-level conclusions.


ResearchClawBench

To address this gap, we developed ResearchClawBench, a benchmark designed specifically for auto research agents.

It follows a two-stage protocol:

  • Stage 1 — Autonomous Research
    The agent is given raw datasets, task instructions, and references, and must independently perform:
    data analysis, code writing, visualization, and report generation.

  • Stage 2 — Paper-level Evaluation
    The generated report is compared against a real published paper, using expert-designed checklists (rubrics) and an LLM-based judge.

The scoring is calibrated such that:

  • 50 ≈ reproducing the original paper (Re-Discovery)
  • 70+ ≈ surpassing it (New Discovery)

Key Characteristics

  • 40 tasks across 10 scientific domains, all derived from real publications
  • Complete datasets and reproducible setups
  • Fine-grained, checklist-based evaluation grounded in expert annotations
  • Support for multiple agents (Claude Code, Codex CLI, OpenClaw, Nanobot) and easy integration of custom agents

In preliminary experiments, most current agents achieve scores in the 20–40 range, indicating a non-trivial gap between current capabilities and full paper-level reproduction.


Discussion

We believe benchmarks of this type may be useful for:

  • understanding the actual capabilities of research agents,
  • identifying gaps between coding ability and scientific reasoning,
  • and providing a more standardized way to compare different systems.

If this direction is relevant to your work, we would be interested to see how your system performs under such a setup, and are happy to discuss further.


Links


ResearchClawBench.mp4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions