GitHub - PrimeIntellect-ai/experiments-autonomous-speedrunning: autonomous nanogpt optimizer speedrun

Prime Intellect

Autonomous Speedrunning Experiment

Raw archive of autonomous agents (Claude Code / Opus 4.7 and Codex / GPT 5.5) competing on the track_3_optimization benchmark from modded-nanogpt: reach validation loss 3.28 in as few training steps as possible. Only the optimizer, schedules, initialization, and a small set of hyperparameters can change.

Blog post: TODO: link

The reference Muon optimizer reaches the target in 3500 steps. The best public record at the start was 3225. By v2, both agents pass 3035. By v3, Claude reaches 2930 and Codex 2950.

Record configs from these runs are submitted as PRs against the modded-nanogpt fork.

What's in here

Harness files the agents were given, the plans and threads they wrote, ~10k run logs, and all generated variants. Everything the blog references lives here.

Folder	What it is
`v1/`	First wave. Beat the Muon baseline at 3500 steps.
`novelty/`	Novelty-constrained wave. Every idea must pass a novelty check (not just a known method or hyperparameter tweak).
`v2/`	Second wave, starting from the best v1/novelty results. Pushing toward 3000.
`v3/`	Third wave, starting from v2 frontier and public PRs. Under-3000 and under-2900 search.

Each wave:

<wave>/
  claude-code/
    AGENTS.md
    goal.md
    plan.md
    scratchpad/
  codex/
    AGENTS.md
    goal.md
    plan.md
    scratchpad/

The two agents ran independently and pursued different strategies. Compare their plan.md, THREAD.md, runs.jsonl, and variants/ rather than assuming the folders are duplicates.

Scratchpad contents

THREAD.md: chronological event log and reasoning trail
runs.jsonl: run ledger
runs/*.log: training logs
variants/*.py: generated training scripts and candidate recipes
sbatch-stubs/*.sh (or top-level run_*.sh): launch scripts
sweeps/<name>/: grouped hyperparameter sweeps
ideas/*.md, papers/*.md, picklist.md, audits.md: literature notes, idea writeups, novelty checks, candidate triage

Anatomy of a run

A "run" is one launch of one candidate. Inside a scratchpad it lives across four files:

variants/<name>.py — the candidate training script the agent generated.
sbatch-stubs/<name>.sh (or run_*.sh) — the launcher that submitted it.
runs/<name>.log — the training log produced.
A row in runs.jsonl — parsed metrics (step_to_3_28, final_val_loss, total_steps, optimizer/HP fields) plus the path back to the source record.

Match a runs.jsonl row to its log and variant by name/uuid.

Aggregated run export

data/runs_self_contained/ is the flat, cross-wave view of every run, useful if you want to filter all ~10k runs without walking individual scratchpads.

Top-level files:

manifest.json — export policy, counts, and schema fields.
runs.jsonl — one JSON object per run.
runs.csv — flat table for quick filtering.
dropped_runs.jsonl — inventory rows omitted from the export.

Per-run files under agents/<agent>/runs/<export_id>/:

metadata.json — structured metadata: final_val_loss, min_val_loss, final_step, train_steps, step_to_3_28, num_val_points, train_time_s, step_avg_ms.
train.log — copied training log.
launched_script.py — copied train/config script (present when resolvable).
source_snapshot.py — exact logged source snapshot (present when emitted).
console.log / launch_stub.sh — console log and sbatch launcher when available.

Counts: 10,428 runs exported (57 dropped for missing config_path from 10,485 inventory rows).

agent	runs
`cc_v1`	605
`codex_v1`	2,165
`cc_novelty`	81
`codex_novelty`	254
`cc_v2`	459
`codex_v2`	2,729
`cc_v3`	1,059
`codex_v3`	3,076

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autonomous Speedrunning Experiment

What's in here

Scratchpad contents

Anatomy of a run

Aggregated run export

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data/runs_self_contained		data/runs_self_contained
novelty		novelty
v1		v1
v2		v2
v3		v3
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Autonomous Speedrunning Experiment

What's in here

Scratchpad contents

Anatomy of a run

Aggregated run export

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages