GitHub - scaleapi/SWE-bench_Pro-os: SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

SWE-Bench Pro

Code and data for the following works:

SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
HuggingFace: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro
Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public
Commercial (Private) Leaderboard: https://scale.com/leaderboard/swe_bench_pro_commercial

News

(2/9) We have removed some unit tests which were outdated (e.g. required the year 2025) or were previously not intended to be included.

(1/7) We have fixed an issue with tutao instances where they take a long time to eval. The relevant run scripts are updated.

(10/28) We added mini-swe-agent! Results are comparable to SWE-Agent for Sonnet 4.5. Feel free to give it a shot. (credit @miguelrc-scale)

(10/28) We have the SWE-Agent scaffold to reproduce results and a step-by-step guide below. We have confirmed that this reproduces the Sonnet 4.5 results. (credit @18vijayb)

(10/3) We have updated results without cap limit here: https://scaleapi.github.io/SWE-bench_Pro-os/

Overview

SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench

To access SWE-bench Pro, copy and run the following code:

from datasets import load_dataset
swebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')

Installation

1. Install Python Dependencies

pip install -r requirements.txt

2. Install Docker

SWE-bench Pro uses Docker for reproducible evaluations.

Follow the instructions in the Docker setup guide to install Docker on your machine. If you're setting up on Linux, we recommend seeing the post-installation steps as well.

3. Configure Modal (Recommended) (or use local docker [Beta])

modal setup  # Follow the prompts to generate your token

After running, verify your credentials in ~/.modal.toml:

token_id = <token id>
token_secret = <token secret>
active = true

Beta: Local Docker. No additional setup needed. Use the --use_local_docker flag when running evaluations.

Docker Images

We provide prebuilt Docker images for each instance on Docker Hub:

Repository: https://hub.docker.com/r/jefzda/sweap-images

Finding the Correct Image

Each instance in the HuggingFace dataset has a dockerhub_tag column containing the Docker tag for that instance. You can access it directly:

from datasets import load_dataset

dataset = load_dataset('ScaleAI/SWE-bench_Pro', split='test')

# Get the Docker image for a specific instance
for row in dataset:
    instance_id = row['instance_id']
    docker_tag = row['dockerhub_tag']
    full_image = f"jefzda/sweap-images:{docker_tag}"
    print(f"{instance_id} -> {full_image}")

Important: Bash runs by default in our images. When running these images, you should not manually invoke bash. See #6

Usage

1. Generate Patches

Generate patch predictions using your harness of choice.

For generating patches using SWE-agent, see the SWE-agent git submodule (note: you will have to use this as a git submodule. See official git documentation for details). The submodule contains detailed instructions to

Set up SWE-agent for patch generation
Run SWE-agent on SWE-Bench Pro instances
Configure model parameters and turn limits

The output will be .pred files containing model-generated patches for each instance.

2. Gather Patches

After generating patches, use the gather_patches.py helper script to collect all patches into a single JSON file for evaluation:

python helper_code/gather_patches.py \
    --directory <path_to_pred_files> \
    --prefix <model_name> \
    --output <output_file>.json

Parameters:

--directory: Directory containing instance folders with .pred files (e.g., from SWE-agent output or downloaded trajectories)
--prefix: Prefix identifier for your model/run (e.g., "gpt4", "claude-sonnet", "sample1")
--output: Output JSON file path

Example:

python helper_code/gather_patches.py \
    --directory swe_bench_pro_results/sample1 \
    --prefix sample1 \
    --output sample1_patches.json

This will create a JSON file in the format expected by the evaluation script:

[
  {
    "instance_id": "instance_...",
    "patch": "diff --git ...",
    "prefix": "sample1"
  }
]

3. Evaluate Patches

Evaluate patch predictions on SWE-Bench Pro:

python swe_bench_pro_eval.py \
    --raw_sample_path=swe_bench_pro_full.csv \
    --patch_path=<your_patches>.json \
    --output_dir=<output_directory> \
    --scripts_dir=run_scripts \
    --num_workers=100 \
    --dockerhub_username=jefzda

You can test with the gold patches, which are in the HuggingFace dataset. There is a helper script in helper_code which can extract the gold patches into the required JSON format.

Reproducing Leaderboard Results

To reproduce leaderboard results end-to-end, follow the following steps:

Complete setup in the SWE-agent submodule. We recommend to use the Docker image to run the scaffold, via just.
Run the scaffold. We have included an example for Claude Sonnet 4.5 (claude.yaml) but feel free to use any model. It also supports vllm for local models. Note that we recommend using the DockerHub images rather than building the Docker images from scratch. You can also execute it locally without Modal.
Compile predictions with helper_code/gather_patches.py.
Run the evaluation script swe_bench_pro_eval.py to run the evaluation script.

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
SWE-agent @ 402a7b8		SWE-agent @ 402a7b8
dockerfiles		dockerfiles
error_analysis		error_analysis
helper_code		helper_code
mini-swe-agent @ d74716a		mini-swe-agent @ d74716a
run_scripts		run_scripts
traj		traj
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
index.html		index.html
requirements.txt		requirements.txt
swe_bench_pro_eval.py		swe_bench_pro_eval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE-Bench Pro

News

Overview

Installation

1. Install Python Dependencies

2. Install Docker

3. Configure Modal (Recommended) (or use local docker [Beta])

Docker Images

Finding the Correct Image

Usage

1. Generate Patches

2. Gather Patches

3. Evaluate Patches

Reproducing Leaderboard Results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Languages

scaleapi/SWE-bench_Pro-os

Folders and files

Latest commit

History

Repository files navigation

SWE-Bench Pro

News

Overview

Installation

1. Install Python Dependencies

2. Install Docker

3. Configure Modal (Recommended) (or use local docker [Beta])

Docker Images

Finding the Correct Image

Usage

1. Generate Patches

2. Gather Patches

3. Evaluate Patches

Reproducing Leaderboard Results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Languages

Packages