Nanotron is a library for pretraining transformer models. It provides a simple and flexible API to pretrain models on custom datasets. Nanotron is designed to be easy to use, fast, and scalable. It is built with the following principles in mind:
- Simplicity: Nanotron is designed to be easy to use. It provides a simple and flexible API to pretrain models on custom datasets.
- Performance: Optimized for speed and scalability, Nanotron uses the latest techniques to train models faster and more efficiently.
📚 Check out our Ultrascale Playbook - A comprehensive guide to efficiently scale LLM training with Nanotron!
# Requirements: Python>=3.10,<3.12
git clone https://github.com/huggingface/nanotron
cd nanotron
pip install --upgrade pip
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
pip install -e .
# Install dependencies if you want to use the example scripts
pip install datasets transformers
pip install triton "flash-attn>=2.5.0" --no-build-isolation
Note
If you get undefined symbol: ncclCommRegister
error you should install torch 2.1.2 instead: pip install torch==2.1.2 --index-url https://download.pytorch.org/whl/cu121
Tip
We log to wandb automatically if it's installed. For that you can use pip install wandb
. If you don't want to use wandb, you can run wandb disabled
.
The following command will train a tiny Llama model on a single node with 8 GPUs. The model will be saved in the checkpoints
directory as specified in the config file.
CUDA_DEVICE_MAX_CONNECTIONS=1 torchrun --nproc_per_node=8 run_train.py --config-file examples/config_tiny_llama.yaml
# or use examples/config_tiny_llama.py to generate your own config
For detailed instructions on training your first model, check out our Your First Training guide.
For multi-node training with Slurm, see our Multi-Node Training guide.
torchrun --nproc_per_node=1 run_generate.py --ckpt-path checkpoints/10/ --tp 1 --pp 1
# We could set a larger TP for faster generation, and a larger PP in case of very large models.
To debug with VSCode, add the following configuration to your launch.json
file:
{
"name": "run_train.py",
"type": "python",
"request": "launch",
"program": "torchrun", // or full path to torchrun by running `which torchrun`
"console": "integratedTerminal",
"justMyCode": false,
"args": [
"--nproc_per_node=2",
"run_train.py",
"--config-file=examples/config_tiny_llama.yaml", // or use examples/config_tiny_llama.py to generate your own config
],
"env": {
// "NANOTRON_BENCHMARK": "1", // enable to benchmark your training for a couple of steps
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"WANDB_MODE": "disabled",
}
},
Note
For more info check Debugging Nanotron example (on multiple GPUs)
You can find more examples in the /examples
directory:
Example | Description |
---|---|
custom-dataloader |
Plug a custom dataloader to nanotron |
datatrove |
Use the datatrove library to load data |
doremi |
Use DoReMi to speed up training |
mamba |
Train an example Mamba model |
moe |
Train an example Mixture-of-Experts (MoE) model |
mup |
Use spectral µTransfer to scale up your model |
examples/config_tiny_llama_with_s3_upload.yaml |
For automatically uploading checkpoints to S3 |
We're working on adding more examples soon! Feel free to add a PR to add your own example. 🚀
We've conducted extensive benchmarking of Nanotron across various model sizes and configurations. The complete benchmark data, configurations, and logs are available in our ultrascale-playbook-data repository.
The diagram above showcases the best configurations we discovered for each model size and node count in nanotron v0.5, highlighting optimal MFU (Model FLOPS Utilization) and memory usage. These represent the most efficient training setups identified through our comprehensive benchmarking process. Stay tuned for even more optimizations coming soon! 🚀
For detailed analysis and best practices derived from these benchmarks, see our Ultrascale Playbook.
We currently support the following features:
- 3D parallelism (DP+TP+PP)
- Expert parallelism for MoEs
- AFAB and 1F1B schedules for PP
- Explicit APIs for TP and PP which enables easy debugging
- ZeRO-1 optimizer
- FP32 gradient accumulation
- Parameter tying/sharding
- Custom module checkpointing for large models
- Spectral µTransfer parametrization for scaling up neural networks
- Mamba example
And we have on our roadmap:
- FP8 training
- ZeRO-3 optimizer (a.k.a FSDP)
-
torch.compile
support - Ring attention
- Interleaved 1f1b schedule
We would like to thank everyone working on LLMs, especially those sharing their work openly from which we took great inspiration: Nvidia for Megatron-LM/apex
, Microsoft for DeepSpeed
, HazyResearch for flash-attn
..