Raw archive of autonomous agents (Claude Code / Opus 4.7 and Codex / GPT 5.5) competing on the track_3_optimization benchmark from modded-nanogpt: reach validation loss 3.28 in as few training steps as possible. Only the optimizer, schedules, initialization, and a small set of hyperparameters can change.
Blog post: TODO: link
The reference Muon optimizer reaches the target in 3500 steps. The best public record at the start was 3225. By v2, both agents pass 3035. By v3, Claude reaches 2930 and Codex 2950.
Record configs from these runs are submitted as PRs against the modded-nanogpt fork.
Harness files the agents were given, the plans and threads they wrote, ~10k run logs, and all generated variants. Everything the blog references lives here.
| Folder | What it is |
|---|---|
v1/ |
First wave. Beat the Muon baseline at 3500 steps. |
novelty/ |
Novelty-constrained wave. Every idea must pass a novelty check (not just a known method or hyperparameter tweak). |
v2/ |
Second wave, starting from the best v1/novelty results. Pushing toward 3000. |
v3/ |
Third wave, starting from v2 frontier and public PRs. Under-3000 and under-2900 search. |
Each wave:
<wave>/
claude-code/
AGENTS.md
goal.md
plan.md
scratchpad/
codex/
AGENTS.md
goal.md
plan.md
scratchpad/
The two agents ran independently and pursued different strategies. Compare their plan.md, THREAD.md, runs.jsonl, and variants/ rather than assuming the folders are duplicates.
THREAD.md: chronological event log and reasoning trailruns.jsonl: run ledgerruns/*.log: training logsvariants/*.py: generated training scripts and candidate recipessbatch-stubs/*.sh(or top-levelrun_*.sh): launch scriptssweeps/<name>/: grouped hyperparameter sweepsideas/*.md,papers/*.md,picklist.md,audits.md: literature notes, idea writeups, novelty checks, candidate triage
A "run" is one launch of one candidate. Inside a scratchpad it lives across four files:
variants/<name>.py— the candidate training script the agent generated.sbatch-stubs/<name>.sh(orrun_*.sh) — the launcher that submitted it.runs/<name>.log— the training log produced.- A row in
runs.jsonl— parsed metrics (step_to_3_28,final_val_loss,total_steps, optimizer/HP fields) plus the path back to the source record.
Match a runs.jsonl row to its log and variant by name/uuid.
data/runs_self_contained/ is the flat, cross-wave view of every run, useful if you want to filter all ~10k runs without walking individual scratchpads.
Top-level files:
manifest.json— export policy, counts, and schema fields.runs.jsonl— one JSON object per run.runs.csv— flat table for quick filtering.dropped_runs.jsonl— inventory rows omitted from the export.
Per-run files under agents/<agent>/runs/<export_id>/:
metadata.json— structured metadata:final_val_loss,min_val_loss,final_step,train_steps,step_to_3_28,num_val_points,train_time_s,step_avg_ms.train.log— copied training log.launched_script.py— copied train/config script (present when resolvable).source_snapshot.py— exact logged source snapshot (present when emitted).console.log/launch_stub.sh— console log and sbatch launcher when available.
Counts: 10,428 runs exported (57 dropped for missing config_path from 10,485 inventory rows).
| agent | runs |
|---|---|
cc_v1 |
605 |
codex_v1 |
2,165 |
cc_novelty |
81 |
codex_novelty |
254 |
cc_v2 |
459 |
codex_v2 |
2,729 |
cc_v3 |
1,059 |
codex_v3 |
3,076 |