Built on top of Karpathy's autoresearch, this fork adds the missing piece: the agent can actually see what happened during training, not just the final number. Providing the agent with the right context signifincalty improves efficiency of the autoresearch agent.
Comparison done on a single H100 using Claude Opus 4.6 within Claude Code.
| autoresearch | auto-log-research | |
|---|---|---|
| What the agent sees | Final val_bpb number | Full training curves, LR schedules, MFU, step times, cross-run comparisons |
| How it decides | Was the number lower? | Why was it lower? What do the dynamics tell me? |
| Experiment tracking | TSV with 5 columns | Structured archives with per-step metrics + agent research notes |
| Analysis | None -- agent flies blind | Automated baseline stats + agent-driven investigation with Python |
| Knowledge accumulation | Reset every run | insights.md (validated findings + current best tracking) + ideas_queue.md (prioritized hypotheses) |
| Research | Hyperparameter search | Hypothesis-driven experimentation |
The core insight: an agent that understands why a change helped will make better next moves than one that only knows whether it helped.
The agent runs in a loop. Each iteration:
- Re-read instructions -- re-read
program.mdto stay on track (context gets long) - Train (5 min) -- run the model on a single GPU, logging per-step metrics to
metrics.jsonl - Analyze -- automated stats: loss trajectory, convergence detection, cross-run comparison
- Investigate -- the agent writes Python to dig deeper: plots loss curves, compares trajectories across runs, checks where differences emerge
- Synthesize -- writes findings to the run's
analysis.md: what was surprising, what patterns emerge - Learn -- updates
insights.md(what works) andideas_queue.md(what to try next) - Decide -- compare against all-time best: keep only if it beats the historical best, otherwise revert to best commit
- Implement -- pick next idea from queue, modify
train.py, commit, repeat
Every step is tracked. Every decision has a paper trail. The agent builds up a knowledge base that gets sharper with each experiment.
Per training step (to metrics.jsonl):
- Training loss (raw + smoothed)
- Learning rate schedule, momentum, weight decay
- Step time, tokens/sec, MFU
Per run (to .auto-log-research/<commit>/):
run_summary.json-- all hyperparameters + final metricsanalysis.md-- automated stats + agent's investigation notesrun.log-- full training outputmetrics.jsonl-- per-step metrics*.png-- auto-generated and custom plots
Across runs:
results.tsv-- master experiment loginsights.md-- validated hypotheses + current best commit/val_bpb (agent curates this)ideas_queue.md-- prioritized experiment queue
Requirements: A single NVIDIA GPU, Python 3.10+, uv.
# 1. Install uv (if you don't have it)
curl -LsSf https://astral.sh/uv/install.sh | sh
# 2. Install dependencies
uv sync
# 3. Download data and train tokenizer (one-time, ~2 min)
uv run prepare.py
# 4. Run Claude Code agent
cat program.md | claude -p Spin up Claude Code (or your preferred agent) in this repo, then prompt:
Hi have a look at program.md and let's kick off a new experiment! let's do the setup first.
The agent reads program.md, sets up a branch, runs the baseline, and enters the autonomous experiment loop. It will keep going indefinitely -- leave it overnight and come back to a log of experiments, each with full analysis.
prepare.py -- constants, data prep + runtime utilities (do not modify)
train.py -- model, optimizer, training loop (agent modifies this)
analyze.py -- post-run analysis + bookkeeping (do not modify)
program.md -- agent instructions (human modifies this)
pyproject.toml -- dependencies
.auto-log-research/ -- experiment tracking (gitignored)
results.tsv master experiment log
insights.md agent's validated knowledge base + current best tracking
ideas_queue.md prioritized experiment queue
<commit>/ per-run archive
analysis.md baseline stats + agent investigation
run_summary.json config + metrics
run.log training output
metrics.jsonl per-step metrics
- Hypothesis-driven, not grid search. The agent has access to full training dynamics. It should use them to form and test hypotheses, not just enumerate hyperparameter combinations.
- Knowledge accumulates.
insights.mdandideas_queue.mdpersist across runs. The agent builds a growing understanding of what works for this specific model/data/compute setup. - Always build on the best. The agent tracks the all-time best commit and val_bpb. Failed experiments revert to the best, not just the previous run. This prevents drift away from good configurations.
- Everything is logged. Per-step metrics go to
metrics.jsonl.analyze.pycomputes baseline statistics. The agent adds deeper analysis. Nothing is lost. - Single file to modify. The agent only touches
train.py. Everything is fair game: architecture, optimizer, hyperparameters, training loop, batch size, model size. - Fixed time budget. Training always runs for exactly 5 minutes. This makes experiments directly comparable regardless of what the agent changes.
This project builds on karpathy/autoresearch, which is a simplified single-GPU implementation of nanochat. The training code, model architecture (GPT with Muon optimizer), and core experiment loop come from there. auto-log-research adds the experiment tracking, analysis pipeline, and agentic investigation layer.
MIT