Code and data for the following works:
-
SWE-bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
-
HuggingFace: https://huggingface.co/datasets/ScaleAI/SWE-bench_Pro
-
Public Leaderboard: https://scale.com/leaderboard/swe_bench_pro_public
-
Commercial (Private) Leaderboard: https://scale.com/leaderboard/swe_bench_pro_commercial
(2/9) We have removed some unit tests which were outdated (e.g. required the year 2025) or were previously not intended to be included.
(1/7) We have fixed an issue with tutao instances where they take a long time to eval. The relevant run scripts are updated.
(10/28) We added mini-swe-agent! Results are comparable to SWE-Agent for Sonnet 4.5. Feel free to give it a shot. (credit @miguelrc-scale)
(10/28) We have the SWE-Agent scaffold to reproduce results and a step-by-step guide below. We have confirmed that this reproduces the Sonnet 4.5 results. (credit @18vijayb)
(10/3) We have updated results without cap limit here: https://scaleapi.github.io/SWE-bench_Pro-os/
SWE-Bench Pro is a challenging benchmark evaluating LLMs/Agents on long-horizon software engineering tasks. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.
The dataset is inspired from SWE-Bench: https://github.com/SWE-bench/SWE-bench
To access SWE-bench Pro, copy and run the following code:
from datasets import load_dataset
swebench = load_dataset('ScaleAI/SWE-bench_Pro', split='test')pip install -r requirements.txtSWE-bench Pro uses Docker for reproducible evaluations.
Follow the instructions in the Docker setup guide to install Docker on your machine. If you're setting up on Linux, we recommend seeing the post-installation steps as well.
modal setup # Follow the prompts to generate your tokenAfter running, verify your credentials in ~/.modal.toml:
token_id = <token id>
token_secret = <token secret>
active = true
Beta: Local Docker. No additional setup needed. Use the --use_local_docker flag when running evaluations.
We provide prebuilt Docker images for each instance on Docker Hub:
Repository: https://hub.docker.com/r/jefzda/sweap-images
Each instance in the HuggingFace dataset has a dockerhub_tag column containing the Docker tag for that instance. You can access it directly:
from datasets import load_dataset
dataset = load_dataset('ScaleAI/SWE-bench_Pro', split='test')
# Get the Docker image for a specific instance
for row in dataset:
instance_id = row['instance_id']
docker_tag = row['dockerhub_tag']
full_image = f"jefzda/sweap-images:{docker_tag}"
print(f"{instance_id} -> {full_image}")Important: Bash runs by default in our images. When running these images, you should not manually invoke bash. See #6
Generate patch predictions using your harness of choice.
For generating patches using SWE-agent, see the SWE-agent git submodule (note: you will have to use this as a git submodule. See official git documentation for details). The submodule contains detailed instructions to
- Set up SWE-agent for patch generation
- Run SWE-agent on SWE-Bench Pro instances
- Configure model parameters and turn limits
The output will be .pred files containing model-generated patches for each instance.
After generating patches, use the gather_patches.py helper script to collect all patches into a single JSON file for evaluation:
python helper_code/gather_patches.py \
--directory <path_to_pred_files> \
--prefix <model_name> \
--output <output_file>.jsonParameters:
--directory: Directory containing instance folders with.predfiles (e.g., from SWE-agent output or downloaded trajectories)--prefix: Prefix identifier for your model/run (e.g., "gpt4", "claude-sonnet", "sample1")--output: Output JSON file path
Example:
python helper_code/gather_patches.py \
--directory swe_bench_pro_results/sample1 \
--prefix sample1 \
--output sample1_patches.jsonThis will create a JSON file in the format expected by the evaluation script:
[
{
"instance_id": "instance_...",
"patch": "diff --git ...",
"prefix": "sample1"
}
]Evaluate patch predictions on SWE-Bench Pro:
python swe_bench_pro_eval.py \
--raw_sample_path=swe_bench_pro_full.csv \
--patch_path=<your_patches>.json \
--output_dir=<output_directory> \
--scripts_dir=run_scripts \
--num_workers=100 \
--dockerhub_username=jefzdaYou can test with the gold patches, which are in the HuggingFace dataset. There is a helper script in helper_code which can extract the gold patches into the required JSON format.
To reproduce leaderboard results end-to-end, follow the following steps:
- Complete setup in the
SWE-agentsubmodule. We recommend to use the Docker image to run the scaffold, viajust. - Run the scaffold. We have included an example for Claude Sonnet 4.5 (claude.yaml) but feel free to use any model. It also supports
vllmfor local models. Note that we recommend using the DockerHub images rather than building the Docker images from scratch. You can also execute it locally without Modal. - Compile predictions with helper_code/gather_patches.py.
- Run the evaluation script
swe_bench_pro_eval.pyto run the evaluation script.