🌟 Awesome AI-Scientist Benchmarks — Expanded Edition

A curated, English Markdown list of benchmarks useful for building and evaluating Proposal2Code pipelines, CodeAgents, and end-to-end AI-scientist capabilities. Each entry includes a compact summary, evaluation focus, datasets / scale (when available), and key findings — with primary paper links cited.

🎯 Purpose

Provide a single, well-organized reference of state-of-the-art benchmarks for evaluating systems that turn research proposals into runnable code and experiments (Proposal2Code), and for measuring the broader capabilities of AI research agents.
Help prioritize which benchmarks to run for different competence axes (paper understanding, code synthesis, experiment design & execution, data analysis, domain-specific discovery).

📊 Benchmark Comparison Table

Benchmark	Focus	Task Type	Scale / Dataset	Main Metrics	Key Insights
PaperBench	Reproducing ML research papers	Paper → Code → Experiment	~dozens of recent ICML/NeurIPS papers	Implementation accuracy, result matching	Tests practical reproducibility of modern papers
MLEBench	Kaggle competition performance	Kaggle ML tasks	100s of Kaggle tasks	Leaderboard score	Measures real-world ML engineering competence
ResearchCodeBench	Novel idea implementation	Code-level tasks	100s of challenges from ~20 papers	Unit test correctness	Exposes gaps in translating novel ideas to code
EXP-Bench	Full experiment automation	Hypothesis → Experiment → Analysis	461 tasks from 51 papers	Stepwise rubric, success rate	Agents rarely succeed end-to-end (<1%)
MLR-Bench	Open-ended ML research	Idea → Proposal → Paper	100s of tasks	LLM+rubric grading	LLMs good at writing, weak at valid experiments
SciReplicate-Bench	NLP algorithm reproduction	Algorithm → Code	~100 tasks	Execution, dependency recall	Highlights incomplete paper descriptions
ML-Bench	Repository-level ML coding	Multi-file repo edits	1000s of GitHub tasks	Pass@k, execution	Tests repo-scale code reasoning
SUPER	Repo setup & execution	Env setup + running tasks	Multi-split benchmark	Success of repo execution	Evaluates engineering reliability
AAAR-1.0	Research assistance	Critical review & math tasks	Multi-domain	Correctness, quality	Can AI help real researchers?
ScienceAgentBench	Data-driven discovery	Paper → Code	~100 expert tasks	Program correctness	Agents solve only a minority of tasks
CORE-Bench	Computational reproducibility	Paper → Result reproduction	270 tasks from 90 papers	Execution accuracy	Reproducibility automation still immature
Auto-Bench	Scientific discovery via causal graphs	Interactive interventions	Simulated environments	Graph accuracy	Iterative discovery is very hard
MLGym	Training RL-style research agents	RL environment	13 open tasks	Success rate	First gym for AI scientist training
SciCode	Science-specific coding	Domain problems	80 problems, 338 subproblems	Pass@k, execution	Extremely low success on real science coding
RE-Bench	Human vs AI R&D	Open-ended ML research	Human expert 8h baselines	Task completion, quality	Humans still stronger with long budgets
GAIA	General AI assistant	Tool use, reasoning	100s of curated Qs	Human-judge accuracy	Good general assistant baseline
SWE-bench	GitHub issue resolution	Patch generation	2,294 real issues	Issue resolution %	Tests multi-file, repo-scale reasoning
Humanity’s Last Exam	Ultimate reasoning	Hard academic Qs	1,000s of Qs	Accuracy	Current LLMs perform very poorly
Paper2Code	Paper → runnable repo	CodeAgent pipeline	100s of ML papers	Repo executability	Direct Proposal2Code evaluation
AlphaGo Moment (Arch Disc.)	Large-scale architecture search	Automated ML design	Massive compute-scale	Discovered architectures	First “AlphaGo moment” for arch search
BaisBench (omics)	Biology research	Omics Q&A + annotation	Domain-specific	Task accuracy	Agents underperform domain experts
SciAssess	Literature analysis	Comprehension & critique	Multi-field	Graded tasks	Bench for survey / paper understanding
BixBench	Computational biology	Bioinformatics analysis	Multiple scenarios	Answer correctness	Domain-specific reasoning benchmark
ResearchArena	Research survey workflow	Collect & organize papers	Millions of papers	Retrieval & clustering	Evaluates survey-building agents
DataSciBench	Data science	Data tasks (open)	Multi-dataset	TFC framework	Tackles ambiguity in data analysis eval
InfiAgent-DABench	Data analysis	CSV Q&A	1,000s of queries	Automatic scoring	Tests agent data-analysis accuracy
Tapilot-Crossing	Interactive data analysis	Multi-turn analysis	Multi-domain	Task success	Focuses on adaptive collaboration

📚 Core AI-Scientist Benchmarks

PaperBench — Evaluating AI’s ability to replicate AI research.

PaperBench tasks agents with end-to-end replication of published ML work: understanding the paper, creating a codebase, running experiments, and matching reported results. The benchmark focuses on reproducibility rubrics that decompose replication into graded sub-tasks (implementation correctness, experiment setup, result matching). PaperBench samples recent high-impact conference papers (e.g., ICML spotlight/oral items) to measure practical research-engineering competence. (arXiv)

MLEBench (MLE-bench) — Measuring ML engineering / Kaggle performance.

MLEBench collects a curated set of Kaggle competitions (hundreds of real ML engineering tasks) so agents are evaluated on realistic ML engineering workflows: dataset prep, model training, hyperparameter tuning, and leaderboard-quality submissions. It establishes human baselines from existing leaderboards and measures the agent’s ability to “push a leaderboard.” Useful when you want to stress test model selection, pipeline automation, and iterative experiment tuning. (OpenAI)

ResearchCodeBench (Stanford) — Implementing novel ML research code.

A focused coding benchmark that converts novel ML research contributions into fine-grained coding challenges (hundreds of challenges across ~20 recent papers). Each challenge targets core implementation components (model blocks, loss terms, training loops) and includes correctness tests co-developed with domain experts. The benchmark exposes how well LLMs translate novel (post-pretraining) research ideas into executable code. (arXiv)

EXP-Bench — Can AI conduct AI research experiments?

(Detailed — from the paper abstract) EXP-Bench is designed to evaluate end-to-end experimental capability: given a research question and partially provided starter code, agents must (1) formulate hypotheses, (2) design experimental procedures, (3) implement and execute experiments, and (4) analyze results. To create realistic tasks, the authors build a semi-autonomous pipeline that extracts and structures experimental details from papers and their open-source code. EXP-Bench curates hundreds of real research tasks (the paper reports ~461 tasks drawn from 51 papers) and provides stepwise, gradeable procedures. Evaluations on current LLM-based agents show partial strength on individual aspects (e.g., design or implementation correctness sometimes score ~20–35%), but the success rate for complete, executable experiments is extremely low (~0.5%), highlighting major gaps in current agents’ ability to run full research experiments. EXP-Bench is explicitly intended to surface the bottlenecks in automating real research experiments (hypothesis→implementation→analysis) and to provide realistic, gradable tasks for improving agents. (arXiv)

MLR-Bench — Open-ended machine learning research evaluation.

MLR-Bench bundles hundreds of open-ended ML research tasks (sourced from workshop papers) and provides an automated judge (MLR-Judge) that mixes LLM reviewers with explicit rubrics. It also ships an agent scaffold (MLR-Agent) to evaluate the full research pipeline (idea → proposal → experiment → paper). Bench results emphasize that while LLMs are strong at generating coherent ideas and prose, experimental validity remains a major challenge (many agent outputs contain fabricated or invalid experimental claims). (arXiv)

SciReplicate-Bench — Reproducing algorithms from papers (NLP focus).

SciReplicate extracts algorithm descriptions from recent NLP papers and frames tasks requiring both algorithm comprehension and repository-level coding. The benchmark contains ~100 tasks from dozens of papers, annotated with test cases and dependency metadata; evaluation metrics include execution accuracy, reasoning-graph similarity, and dependency recall. This benchmark highlights problems caused by incomplete or inconsistent algorithm descriptions in papers. (arXiv)

ML-Bench — Repository-level ML tasks.

ML-Bench evaluates LLMs and autonomous agents on real repository tasks (thousands of examples across many GitHub repos). It focuses on repository-scale reasoning (argument selection for routines, multi-file edits, bash/script generation) and provides two modes: LLM (text→code) and agent (sandboxed execution + iterative action). ML-Bench is useful for measuring the agent’s ability to work inside a full research codebase. (arXiv)

SUPER — Setting up & executing tasks from research repositories.

SUPER evaluates an agent’s ability to set up research environments (install deps, config, run experiments) and then execute tasks from public repositories. SUPER contains multiple splits (Expert, Masked, Auto) to stress different capabilities (full problem, focused subproblems, automatic extraction). This is a practical benchmark for measuring engineering reliability when reproducing papers from GitHub. (arXiv)

AAAR-1.0 — Assessing AI’s potential to assist research.

AAAR-1.0 focuses on research-centric tasks that require deep expertise: equation inference/validation, experiment design, identifying paper weaknesses, and reviewer-style critique. It evaluates whether LLMs can help with the day-to-day intellectual tasks researchers perform and highlights where models do and do not provide reliable outputs for expert audiences. (arXiv)

ScienceAgentBench — Language agents for data-driven scientific discovery.

ScienceAgentBench extracts ~100 real tasks from peer-reviewed papers across multiple disciplines and validates tasks with subject experts. Each task has a canonical Python program as the target output; the benchmark measures program correctness, execution results, and resource costs. Current state-of-the-art agents solve only a minority of tasks, underlining limits in agent-driven data discovery. (arXiv)

CORE-Bench — Computational reproducibility at scale.

CORE-Bench is built to test computational reproducibility: reproduce experiment results using provided code/data across multiple disciplines. It contains hundreds of tasks (e.g., 270 tasks from 90 papers) spanning CS, social sciences, and medicine, with automated evaluation pipelines. Baseline agents show modest success (e.g., low accuracy on hardest tasks), indicating reproducibility automation is still immature. (arXiv)

Auto-Bench — Benchmarking LLMs for scientific discovery (causal graphs).

Auto-Bench frames scientific discovery as interactive causal-graph discovery: agents iteratively propose interventions, observe or query an oracle, and update hypotheses. Settings include simulated chemistry and social networks; the benchmark measures an agent’s ability to discover hidden structures and produce valid justifications. Performance drops quickly as complexity rises, signaling gaps in iterative scientific reasoning. (arXiv)

MLGym — A gym for ML research agents.

MLGym provides an RL-style environment and 13 open-ended research tasks (vision, NLP, RL, game theory). It’s designed to let researchers train agents using reinforcement learning or imitation approaches for research workflows: idea generation, data synthesis, implementation, experimentation, and iteration. Useful if you want to train agents to improve over time. (arXiv)

SciCode — Scientist-curated coding benchmark (natural sciences).

SciCode collects realistic, scientist-authored research problems across physical & life sciences and decomposes them into subproblems (e.g., 80 main problems → 338 subproblems). Each subproblem blends domain knowledge, reasoning, and code synthesis; state-of-the-art models solve only a very small fraction in realistic settings — revealing the steep gap between general code generation and research-grade scientific coding. (arXiv)

RE-Bench — Comparing AI R&D agents to human experts.

RE-Bench provides open-ended ML research engineering environments with human expert runs (real 8-hour attempts) and agent baselines. It offers a realistic, time-budgeted comparison showing agents can be much faster (and sometimes score higher in short time budgets), but humans still scale better with longer time — a nuanced, resource-aware evaluation for “automation risk” analysis. (arXiv)

🤖 General agent & code-agent benchmarks

GAIA — General AI assistant tasks (multi-modal, tool use).

GAIA is a broad assistant benchmark (hundreds of human-designed questions) that stresses tool use, browsing, multi-modality, and practical reasoning — a good sanity check for general assistant competencies before narrowing to research tasks. (Human vs. model gaps are large on challenging questions.) (arXiv)

SWE-bench — Real GitHub issue resolution (software engineering).

SWE-bench contains thousands of real GitHub issues + PR fixes and asks models to generate patches that resolve issues across multiple files and contexts. It’s a demanding engineering benchmark — the best models historically resolve only a small percentage of realistic issues. Use it to test multi-file reasoning & patch generation. (arXiv)

Humanity’s Last Exam (HLE) — Frontier, closed-ended academic questions.

HLE is a multi-modal, expert-curated benchmark of thousands of extremely challenging, closed-ended academic questions (math, science, humanities). It’s intended as a frontier stress-test: current frontier LLMs perform very poorly on HLE, making it a strong evaluation for high-level reasoning and domain mastery. (arXiv)

Paper2Code / PaperCoder — from paper → runnable repo.

Paper2Code (PaperCoder) proposes a multi-agent framework that ingests ML papers and autonomously generates a runnable code repository (planning, implementation, and verification). It directly targets the Proposal2Code workflow and reports strong gains when evaluated on related benchmarks (including PaperBench). This is a practical starting point for a CodeAgent design. (arXiv)

🧪 Large-scale / architecture discovery

AlphaGo Moment for Model Architecture Discovery (ASI-Arch)

A multi-agent, large-scale system that autonomously searches model architectures (reported large numbers of autonomous experiments and discovered novel architectures). This paper demonstrates the feasibility of highly automated, compute-intensive architecture discovery and provides lessons for scaling Proposal2Code pipelines that include large experimental searches. (arXiv)

🧬 Domain-specific AI-scientist benchmarks

Benchmarking AI scientists in omics / BaisBench — Data-driven biological discovery (cell type annotation tasks + domain question answering). Highlights how data + domain knowledge are essential and how current agents underperform domain experts. (arXiv)
SciAssess — Benchmarks LLM skill in scientific literature analysis across memorization, comprehension, and analysis tiers (multi-field coverage). Useful for measuring literature-understanding modules in a Proposal2Code stack. (arXiv)
BixBench — LLM-agent benchmark for computational biology — real analysis scenarios + open-answer questions to evaluate multi-step bioinformatics reasoning. (arXiv)

🔎 Survey / information-collection benchmarks

ResearchArena — Evaluates LLM agents on the research survey workflow: discovery → selection → organization (offline corpus of millions of papers). Great for measuring agents that must collect & synthesize prior work before proposing experiments. (arXiv)

📊 Data / interactive analysis benchmarks

DataSciBench — An LLM agent benchmark for data-science tasks with a semi-automated ground-truth pipeline and a Task–Function–Code (TFC) evaluation framework; stresses open-ended data tasks and evaluation ambiguity. (arXiv)
InfiAgent-DABench — Agent evaluation specifically for CSV / data analysis question answering with closed-form conversion to enable automatic scoring; strong baseline of agent frameworks included. (arXiv)
Tapilot-Crossing — Interactive data-analysis benchmark that evaluates multi-turn, human-agent collaboration logics and adaptive strategies (useful for interactive Proposal2Code where a human steers experiments). (arXiv)

✅ How to use this catalog (practical suggestions)

Map capability → benchmark:
- Paper understanding & idea extraction → ResearchArena, SciAssess, AAAR-1.0.
- Codegen for research algorithms → ResearchCodeBench, SciReplicate, SciCode.
- Reproducibility & repo setup → SUPER, CORE-Bench, ML-Bench.
- End-to-end experiment design & execution → EXP-Bench, MLR-Bench, RE-Bench, PaperBench.
- Domain discovery (biology/omics) → BaisBench, BixBench, SciCode.
Start small, iterate: pick a 2–3 benchmark suite matching your goal (e.g., ResearchCodeBench + EXP-Bench + Paper2Code for Proposal2Code), run baseline agents, inspect failure modes (missing deps, incomplete implementations, wrong experimental configs), and design targeted improvements (better dependency resolution, iterative test-and-debug loops, sandboxed execution).
Combine automated and human review: many benchmarks intentionally require subject-expert validation or multi-stage rubrics — mix automated metrics (execution accuracy, pass@k) with human annotations for robust evaluation.

♻️ Contributions & license

PRs / issues welcome — add missing benchmarks or new evaluation notes.
Suggested license: MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitattributes		.gitattributes
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🌟 Awesome AI-Scientist Benchmarks — Expanded Edition

🎯 Purpose

📊 Benchmark Comparison Table

📚 Core AI-Scientist Benchmarks

PaperBench — Evaluating AI’s ability to replicate AI research.

MLEBench (MLE-bench) — Measuring ML engineering / Kaggle performance.

ResearchCodeBench (Stanford) — Implementing novel ML research code.

EXP-Bench — Can AI conduct AI research experiments?

MLR-Bench — Open-ended machine learning research evaluation.

SciReplicate-Bench — Reproducing algorithms from papers (NLP focus).

ML-Bench — Repository-level ML tasks.

SUPER — Setting up & executing tasks from research repositories.

AAAR-1.0 — Assessing AI’s potential to assist research.

ScienceAgentBench — Language agents for data-driven scientific discovery.

CORE-Bench — Computational reproducibility at scale.

Auto-Bench — Benchmarking LLMs for scientific discovery (causal graphs).

MLGym — A gym for ML research agents.

SciCode — Scientist-curated coding benchmark (natural sciences).

RE-Bench — Comparing AI R&D agents to human experts.

🤖 General agent & code-agent benchmarks

GAIA — General AI assistant tasks (multi-modal, tool use).

SWE-bench — Real GitHub issue resolution (software engineering).

Humanity’s Last Exam (HLE) — Frontier, closed-ended academic questions.

Paper2Code / PaperCoder — from paper → runnable repo.

🧪 Large-scale / architecture discovery

AlphaGo Moment for Model Architecture Discovery (ASI-Arch)

🧬 Domain-specific AI-scientist benchmarks

🔎 Survey / information-collection benchmarks

📊 Data / interactive analysis benchmarks

✅ How to use this catalog (practical suggestions)

♻️ Contributions & license

About

Uh oh!

Releases

Packages

hflyzju/Awesome-AI-Scientist-Benchmarks

Folders and files

Latest commit

History

Repository files navigation

🌟 Awesome AI-Scientist Benchmarks — Expanded Edition

🎯 Purpose

📊 Benchmark Comparison Table

📚 Core AI-Scientist Benchmarks

PaperBench — Evaluating AI’s ability to replicate AI research.

MLEBench (MLE-bench) — Measuring ML engineering / Kaggle performance.

ResearchCodeBench (Stanford) — Implementing novel ML research code.

EXP-Bench — Can AI conduct AI research experiments?

MLR-Bench — Open-ended machine learning research evaluation.

SciReplicate-Bench — Reproducing algorithms from papers (NLP focus).

ML-Bench — Repository-level ML tasks.

SUPER — Setting up & executing tasks from research repositories.

AAAR-1.0 — Assessing AI’s potential to assist research.

ScienceAgentBench — Language agents for data-driven scientific discovery.

CORE-Bench — Computational reproducibility at scale.

Auto-Bench — Benchmarking LLMs for scientific discovery (causal graphs).

MLGym — A gym for ML research agents.

SciCode — Scientist-curated coding benchmark (natural sciences).

RE-Bench — Comparing AI R&D agents to human experts.

🤖 General agent & code-agent benchmarks

GAIA — General AI assistant tasks (multi-modal, tool use).

SWE-bench — Real GitHub issue resolution (software engineering).

Humanity’s Last Exam (HLE) — Frontier, closed-ended academic questions.

Paper2Code / PaperCoder — from paper → runnable repo.

🧪 Large-scale / architecture discovery

AlphaGo Moment for Model Architecture Discovery (ASI-Arch)

🧬 Domain-specific AI-scientist benchmarks

🔎 Survey / information-collection benchmarks

📊 Data / interactive analysis benchmarks

✅ How to use this catalog (practical suggestions)

♻️ Contributions & license

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages