Skip to content

SWE‐bench

Christopher David edited this page Mar 17, 2024 · 19 revisions

Background

Objective

This is what we want to beat:

Strategy

  • Analyze Devin's results repo
    • What are the patterns of success/failure?
    • What can we learn of their approach?
      • Agentic loop with GPT-4, RAG, and ___?
    • What architecture/algorithms are needed for success - and how best to build them?

Implementation

  • Set up benchmark
  • Run basic agent to get an initial benchmark
  • Iterate until solving an issue Devin didn't

Scratchpad

python run_api.py \
    --dataset_name_or_path princeton-nlp/SWE-bench \
    --model_name_or_path gpt-4-0613 \
    --output_dir results \
    --model_args "temperature=0.7,top_p=0.95" \
    --max_cost 200
Clone this wiki locally