SWE‐bench

Background

SWE-bench is a benchmark suite for real-world software engineering.
- "Can Language Models Resolve Real-World GitHub Issues?"
Cognition Labs / Devin just claimed a 13.86% success rate.
- It's not "official" til it's on the swebench.com leaderboard but we'll assume this is SOTA and the number to beat
- Blog: Devin SWE-bench summary
- GitHub: Devin SWE-bench results & evaluation harness

Objective

This is what we want to beat:

Strategy

Analyze Devin's results repo
- What are the patterns of success/failure?
- What can we learn of their approach?
  - Agentic loop with GPT-4, RAG, and ___?
- What architecture/algorithms are needed for success - and how best to build them?

Implementation

Set up benchmark
Run basic agent to get an initial benchmark
Iterate until solving an issue Devin didn't

Scratchpad

python run_api.py \
    --dataset_name_or_path princeton-nlp/SWE-bench \
    --model_name_or_path gpt-4-0613 \
    --output_dir results \
    --model_args "temperature=0.7,top_p=0.95" \
    --max_cost 200

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SWE‐bench

Background

Objective

Strategy

Implementation

Scratchpad

Clone this wiki locally