You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On Evaluating the Emerging Wave of Auto Research Agents
Hi! 👋
The recent surge of auto research agents has been both surprising and impressive.
Systems like Claude Code, Codex CLI, OpenClaw, and others are beginning to demonstrate increasingly strong capabilities in coding, reasoning, and multi-step tool use.
However, this also raises a fundamental question:
How should we rigorously evaluate whether an agent can actually conduct scientific research?
Most existing benchmarks focus on:
knowledge recall,
reasoning,
or code generation,
but they rarely evaluate the end-to-end research process — from raw data to paper-level conclusions.
ResearchClawBench
To address this gap, we developed ResearchClawBench, a benchmark designed specifically for auto research agents.
It follows a two-stage protocol:
Stage 1 — Autonomous Research
The agent is given raw datasets, task instructions, and references, and must independently perform:
data analysis, code writing, visualization, and report generation.
Stage 2 — Paper-level Evaluation
The generated report is compared against a real published paper, using expert-designed checklists (rubrics) and an LLM-based judge.
The scoring is calibrated such that:
50 ≈ reproducing the original paper (Re-Discovery)
70+ ≈ surpassing it (New Discovery)
Key Characteristics
40 tasks across 10 scientific domains, all derived from real publications
Complete datasets and reproducible setups
Fine-grained, checklist-based evaluation grounded in expert annotations
Support for multiple agents (Claude Code, Codex CLI, OpenClaw, Nanobot) and easy integration of custom agents
In preliminary experiments, most current agents achieve scores in the 20–40 range, indicating a non-trivial gap between current capabilities and full paper-level reproduction.
Discussion
We believe benchmarks of this type may be useful for:
understanding the actual capabilities of research agents,
identifying gaps between coding ability and scientific reasoning,
and providing a more standardized way to compare different systems.
If this direction is relevant to your work, we would be interested to see how your system performs under such a setup, and are happy to discuss further.
On Evaluating the Emerging Wave of Auto Research Agents
Hi! 👋
The recent surge of auto research agents has been both surprising and impressive.
Systems like Claude Code, Codex CLI, OpenClaw, and others are beginning to demonstrate increasingly strong capabilities in coding, reasoning, and multi-step tool use.
However, this also raises a fundamental question:
Most existing benchmarks focus on:
but they rarely evaluate the end-to-end research process — from raw data to paper-level conclusions.
ResearchClawBench
To address this gap, we developed ResearchClawBench, a benchmark designed specifically for auto research agents.
It follows a two-stage protocol:
Stage 1 — Autonomous Research
The agent is given raw datasets, task instructions, and references, and must independently perform:
data analysis, code writing, visualization, and report generation.
Stage 2 — Paper-level Evaluation
The generated report is compared against a real published paper, using expert-designed checklists (rubrics) and an LLM-based judge.
The scoring is calibrated such that:
Key Characteristics
In preliminary experiments, most current agents achieve scores in the 20–40 range, indicating a non-trivial gap between current capabilities and full paper-level reproduction.
Discussion
We believe benchmarks of this type may be useful for:
If this direction is relevant to your work, we would be interested to see how your system performs under such a setup, and are happy to discuss further.
Links
ResearchClawBench.mp4