Bag of Tricks for Transformers

A unified benchmarking framework for systematically evaluating training techniques, architectural modifications, and optimization strategies for large language model (LLM) pretraining.

Motivation

The Transformer ecosystem has accumulated a large number of training "tricks" — spanning optimizers (e.g., Muon, SOAP), learning rate and batch schedules, architectural variants (e.g., SwiGLU, value embeddings, U-Net skips, XSA), initialization schemes, data ordering strategies, and more. These techniques originate from diverse sources such as the NanoGPT Speedrun, Parameter Golf, and Slowrun competitions, yet are rarely compared under identical, controlled conditions.

This repository provides a standardized experimental harness that isolates the effect of each trick by holding everything else constant — same model skeleton, same data pipeline, same tokenizer, same hardware budget — so that observed gains can be attributed with confidence.

Evaluation Protocol

Every trick is evaluated under two complementary ablation regimes:

Regime	Control Variable	What It Measures
Fixed Compute	Wall-clock time (= approximate FLOPs)	Efficiency: how much validation loss improves within a fixed compute budget.
Fixed Tokens	Total training tokens	Sample efficiency: how much validation loss improves given the same amount of data.

By reporting results on both axes, we disentangle tricks that merely redistribute compute (e.g., trading depth for width) from those that yield genuine algorithmic improvements.

Repository Structure

├── exp/                        # Experiment directory (one sub-folder per trick)
│   ├── baseline-sp1024/        # Reference baseline
│   │   ├── train_gpt.py        # Trainer script
│   │   └── baseline-sp1024.json# Experiment manifest
│   ├── seq-2048/               # Seq-length ablation (same tokenizer/data, seq_len=2048)
│   │   ├── train_gpt.py
│   │   └── seq-2048.json
│   ├── seq-4096/               # Seq-length ablation (same tokenizer/data, seq_len=4096)
│   │   ├── train_gpt.py
│   │   └── seq-4096.json
│   ├── muon/                   # Example trick: Muon optimizer
│   └── run_experiments.py      # Unified experiment scheduler
├── data/                       # Data pipeline & tokenizer assets
├── docs/                       # Design documentation
├── TRICK_SUMMARY_TRAIN_ARCH.md # Curated catalog of known tricks
└── pyproject.toml

Quick Start

Adding a New Trick

Taking my_trick as an example:

1. Create the method directory and prepare a trainer script

mkdir -p exp/my_trick
cp exp/baseline-sp1024/train_gpt.py exp/my_trick/train_gpt.py

Edit exp/my_trick/train_gpt.py and annotate every modification with:

# trick: <brief description of the change>

Keep everything else identical to the baseline to ensure a fair comparison.

Each experiment directory should include a README.md that covers:

Method overview: what the trick changes and why.
Impact on training: how it affects throughput, memory, convergence, etc.
BPB analysis: comparison with the baseline under both evaluation regimes, and discussion of possible factors behind the observed BPB change (e.g., better gradient signal, longer context, reduced overhead).

See exp/seq-2048/README.md for a concrete example.

2. Create an experiment manifest

cp exp/baseline-sp1024/baseline-sp1024.json exp/my_trick/my_trick.json

Update trainer_path and name in exp/my_trick/my_trick.json:

{
  "version": 1,
  "trainer_path": "exp/my_trick/train_gpt.py",
  "experiments": [
    {
      "name": "my_trick-fixed_time_10min",
      "control": { "mode": "fixed_compute", "target_wallclock_seconds": 600 }
    },
    {
      "name": "my_trick-fixed_tokens_10b",
      "control": { "mode": "fixed_tokens", "target_train_tokens": 10000000000 }
    }
  ]
}

For pure seq-length ablations, keep data_path, tokenizer_path, and vocab_size unchanged, and only change train_seq_len in the copied manifest.

3. Dry-run to verify the configuration

uv run python exp/run_experiments.py exp/my_trick/my_trick.json --dry-run

4. Launch

uv run python exp/run_experiments.py exp/my_trick/my_trick.json

All experiments in the manifest are executed sequentially on 8×H100 GPUs.

Output

Each experiment produces a result.json containing:

Control: mode, target / actual tokens and wall-clock seconds.
Metrics: final_val_bpb (bits-per-byte on held-out validation).
Model: architecture config and total parameter count.

Compare against the baseline result.json to quantify the gain.

Trick Catalog

See TRICK_SUMMARY_TRAIN_ARCH.md for a curated inventory of 40+ training and architecture tricks drawn from nanogpt-speedrun, parameter-golf, and slowrun, organized by category (optimizer, schedule, architecture, data, initialization, etc.).

Acknowledgments

This project builds upon the open-source training recipes and competitive results from the NanoGPT Speedrun, Parameter Golf, and Slowrun communities.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data		data
exp		exp
wandb		wandb
.gitignore		.gitignore
README.md		README.md
TRICK_SUMMARY_TRAIN_ARCH.md		TRICK_SUMMARY_TRAIN_ARCH.md
pyproject.toml		pyproject.toml
run_all_ablations.sh		run_all_ablations.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bag of Tricks for Transformers

Motivation

Evaluation Protocol

Repository Structure

Quick Start

Adding a New Trick

Output

Trick Catalog

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bag of Tricks for Transformers

Motivation

Evaluation Protocol

Repository Structure

Quick Start

Adding a New Trick

Output

Trick Catalog

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages