Are current benchmarks enough to measure research agents?

### **On Evaluating the Emerging Wave of Auto Research Agents**

Hi! 👋

The recent surge of *auto research agents* has been both surprising and impressive.
Systems like Claude Code, Codex CLI, OpenClaw, and others are beginning to demonstrate increasingly strong capabilities in coding, reasoning, and multi-step tool use.

However, this also raises a fundamental question:

> **How should we rigorously evaluate whether an agent can *actually conduct scientific research*?**

Most existing benchmarks focus on:

* knowledge recall,
* reasoning,
* or code generation,

but they rarely evaluate the **end-to-end research process** — from raw data to paper-level conclusions.

---

### **ResearchClawBench**

To address this gap, we developed **ResearchClawBench**, a benchmark designed specifically for **auto research agents**.

It follows a two-stage protocol:

* **Stage 1 — Autonomous Research**
  The agent is given raw datasets, task instructions, and references, and must independently perform:
  data analysis, code writing, visualization, and report generation.

* **Stage 2 — Paper-level Evaluation**
  The generated report is compared against a **real published paper**, using expert-designed checklists (rubrics) and an LLM-based judge.

The scoring is calibrated such that:

* **50 ≈ reproducing the original paper (Re-Discovery)**
* **70+ ≈ surpassing it (New Discovery)**

---

### **Key Characteristics**

* **40 tasks across 10 scientific domains**, all derived from real publications
* Complete datasets and reproducible setups
* Fine-grained, checklist-based evaluation grounded in expert annotations
* Support for multiple agents (Claude Code, Codex CLI, OpenClaw, Nanobot) and easy integration of custom agents

In preliminary experiments, most current agents achieve scores in the **20–40 range**, indicating a non-trivial gap between current capabilities and full paper-level reproduction.

---

### **Discussion**

We believe benchmarks of this type may be useful for:

* understanding the actual capabilities of research agents,
* identifying gaps between coding ability and scientific reasoning,
* and providing a more standardized way to compare different systems.

If this direction is relevant to your work, we would be interested to see how your system performs under such a setup, and are happy to discuss further.

---

### **Links**

* Website: [https://internscience.github.io/ResearchClawBench-Home/](https://internscience.github.io/ResearchClawBench-Home/)
* GitHub: [https://github.com/InternScience/ResearchClawBench](https://github.com/InternScience/ResearchClawBench)

---

https://github.com/user-attachments/assets/70828dac-f629-4652-9c96-8e416420f2b4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are current benchmarks enough to measure research agents? #114

On Evaluating the Emerging Wave of Auto Research Agents

ResearchClawBench

Key Characteristics

Discussion

Links

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Are current benchmarks enough to measure research agents? #114

Description

On Evaluating the Emerging Wave of Auto Research Agents

ResearchClawBench

Key Characteristics

Discussion

Links

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions