A unified benchmarking framework for systematically evaluating training techniques, architectural modifications, and optimization strategies for large language model (LLM) pretraining.
The Transformer ecosystem has accumulated a large number of training "tricks" — spanning optimizers (e.g., Muon, SOAP), learning rate and batch schedules, architectural variants (e.g., SwiGLU, value embeddings, U-Net skips, XSA), initialization schemes, data ordering strategies, and more. These techniques originate from diverse sources such as the NanoGPT Speedrun, Parameter Golf, and Slowrun competitions, yet are rarely compared under identical, controlled conditions.
This repository provides a standardized experimental harness that isolates the effect of each trick by holding everything else constant — same model skeleton, same data pipeline, same tokenizer, same hardware budget — so that observed gains can be attributed with confidence.
Every trick is evaluated under two complementary ablation regimes:
| Regime | Control Variable | What It Measures |
|---|---|---|
| Fixed Compute | Wall-clock time (= approximate FLOPs) | Efficiency: how much validation loss improves within a fixed compute budget. |
| Fixed Tokens | Total training tokens | Sample efficiency: how much validation loss improves given the same amount of data. |
By reporting results on both axes, we disentangle tricks that merely redistribute compute (e.g., trading depth for width) from those that yield genuine algorithmic improvements.
├── exp/ # Experiment directory (one sub-folder per trick)
│ ├── baseline-sp1024/ # Reference baseline
│ │ ├── train_gpt.py # Trainer script
│ │ └── baseline-sp1024.json# Experiment manifest
│ ├── seq-2048/ # Seq-length ablation (same tokenizer/data, seq_len=2048)
│ │ ├── train_gpt.py
│ │ └── seq-2048.json
│ ├── seq-4096/ # Seq-length ablation (same tokenizer/data, seq_len=4096)
│ │ ├── train_gpt.py
│ │ └── seq-4096.json
│ ├── muon/ # Example trick: Muon optimizer
│ └── run_experiments.py # Unified experiment scheduler
├── data/ # Data pipeline & tokenizer assets
├── docs/ # Design documentation
├── TRICK_SUMMARY_TRAIN_ARCH.md # Curated catalog of known tricks
└── pyproject.toml
Taking my_trick as an example:
1. Create the method directory and prepare a trainer script
mkdir -p exp/my_trick
cp exp/baseline-sp1024/train_gpt.py exp/my_trick/train_gpt.pyEdit exp/my_trick/train_gpt.py and annotate every modification with:
# trick: <brief description of the change>Keep everything else identical to the baseline to ensure a fair comparison.
Each experiment directory should include a README.md that covers:
- Method overview: what the trick changes and why.
- Impact on training: how it affects throughput, memory, convergence, etc.
- BPB analysis: comparison with the baseline under both evaluation regimes, and discussion of possible factors behind the observed BPB change (e.g., better gradient signal, longer context, reduced overhead).
See exp/seq-2048/README.md for a concrete example.
2. Create an experiment manifest
cp exp/baseline-sp1024/baseline-sp1024.json exp/my_trick/my_trick.jsonUpdate trainer_path and name in exp/my_trick/my_trick.json:
{
"version": 1,
"trainer_path": "exp/my_trick/train_gpt.py",
"experiments": [
{
"name": "my_trick-fixed_time_10min",
"control": { "mode": "fixed_compute", "target_wallclock_seconds": 600 }
},
{
"name": "my_trick-fixed_tokens_10b",
"control": { "mode": "fixed_tokens", "target_train_tokens": 10000000000 }
}
]
}For pure seq-length ablations, keep data_path, tokenizer_path, and vocab_size unchanged, and only change train_seq_len in the copied manifest.
3. Dry-run to verify the configuration
uv run python exp/run_experiments.py exp/my_trick/my_trick.json --dry-run4. Launch
uv run python exp/run_experiments.py exp/my_trick/my_trick.jsonAll experiments in the manifest are executed sequentially on 8×H100 GPUs.
Each experiment produces a result.json containing:
- Control:
mode,target / actualtokens and wall-clock seconds. - Metrics:
final_val_bpb(bits-per-byte on held-out validation). - Model: architecture config and total parameter count.
Compare against the baseline result.json to quantify the gain.
See TRICK_SUMMARY_TRAIN_ARCH.md for a curated inventory of 40+ training and architecture tricks drawn from nanogpt-speedrun, parameter-golf, and slowrun, organized by category (optimizer, schedule, architecture, data, initialization, etc.).
This project builds upon the open-source training recipes and competitive results from the NanoGPT Speedrun, Parameter Golf, and Slowrun communities.