Skip to content

PrimeIntellect-ai/experiments-autonomous-speedrunning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Prime Intellect

Autonomous Speedrunning Experiment

Raw archive of autonomous agents (Claude Code / Opus 4.7 and Codex / GPT 5.5) competing on the track_3_optimization benchmark from modded-nanogpt: reach validation loss 3.28 in as few training steps as possible. Only the optimizer, schedules, initialization, and a small set of hyperparameters can change.

Blog post: TODO: link

The reference Muon optimizer reaches the target in 3500 steps. The best public record at the start was 3225. By v2, both agents pass 3035. By v3, Claude reaches 2930 and Codex 2950.

Record configs from these runs are submitted as PRs against the modded-nanogpt fork.

What's in here

Harness files the agents were given, the plans and threads they wrote, ~10k run logs, and all generated variants. Everything the blog references lives here.

Folder What it is
v1/ First wave. Beat the Muon baseline at 3500 steps.
novelty/ Novelty-constrained wave. Every idea must pass a novelty check (not just a known method or hyperparameter tweak).
v2/ Second wave, starting from the best v1/novelty results. Pushing toward 3000.
v3/ Third wave, starting from v2 frontier and public PRs. Under-3000 and under-2900 search.

Each wave:

<wave>/
  claude-code/
    AGENTS.md
    goal.md
    plan.md
    scratchpad/
  codex/
    AGENTS.md
    goal.md
    plan.md
    scratchpad/

The two agents ran independently and pursued different strategies. Compare their plan.md, THREAD.md, runs.jsonl, and variants/ rather than assuming the folders are duplicates.

Scratchpad contents

  • THREAD.md: chronological event log and reasoning trail
  • runs.jsonl: run ledger
  • runs/*.log: training logs
  • variants/*.py: generated training scripts and candidate recipes
  • sbatch-stubs/*.sh (or top-level run_*.sh): launch scripts
  • sweeps/<name>/: grouped hyperparameter sweeps
  • ideas/*.md, papers/*.md, picklist.md, audits.md: literature notes, idea writeups, novelty checks, candidate triage

Anatomy of a run

A "run" is one launch of one candidate. Inside a scratchpad it lives across four files:

  1. variants/<name>.py — the candidate training script the agent generated.
  2. sbatch-stubs/<name>.sh (or run_*.sh) — the launcher that submitted it.
  3. runs/<name>.log — the training log produced.
  4. A row in runs.jsonl — parsed metrics (step_to_3_28, final_val_loss, total_steps, optimizer/HP fields) plus the path back to the source record.

Match a runs.jsonl row to its log and variant by name/uuid.

Aggregated run export

data/runs_self_contained/ is the flat, cross-wave view of every run, useful if you want to filter all ~10k runs without walking individual scratchpads.

Top-level files:

  • manifest.json — export policy, counts, and schema fields.
  • runs.jsonl — one JSON object per run.
  • runs.csv — flat table for quick filtering.
  • dropped_runs.jsonl — inventory rows omitted from the export.

Per-run files under agents/<agent>/runs/<export_id>/:

  • metadata.json — structured metadata: final_val_loss, min_val_loss, final_step, train_steps, step_to_3_28, num_val_points, train_time_s, step_avg_ms.
  • train.log — copied training log.
  • launched_script.py — copied train/config script (present when resolvable).
  • source_snapshot.py — exact logged source snapshot (present when emitted).
  • console.log / launch_stub.sh — console log and sbatch launcher when available.

Counts: 10,428 runs exported (57 dropped for missing config_path from 10,485 inventory rows).

agent runs
cc_v1 605
codex_v1 2,165
cc_novelty 81
codex_novelty 254
cc_v2 459
codex_v2 2,729
cc_v3 1,059
codex_v3 3,076

About

autonomous nanogpt optimizer speedrun

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors