Multi-Node Training with Nanotron

This guide explains how to train models with Nanotron across multiple compute nodes using Slurm, a popular workload manager for high-performance computing (HPC) clusters.

Using the Slurm Launcher

Nanotron provides a convenient script (slurm_launcher.py) to simplify launching multi-node training jobs on Slurm clusters. This script handles configuration generation, resource allocation, and job submission in one step.

Basic Usage

python slurm_launcher.py --run_name my_experiment --nodes 4 --model_size base

This will:

Generate a Nanotron configuration file based on your parameters
Create a Slurm job script with appropriate settings
Submit the job to the Slurm scheduler
Save everything needed for reproducibility

Important Parameters

The launcher supports many parameters, organized into logical groups:

Required Parameters

--run_name: Name for your experiment (will be used in logs and checkpoints)

Slurm Configuration

--nodes: Number of nodes to use (default: 2)
--gpus_per_node: Number of GPUs per node (default: 8)
--partition: Slurm partition to use (default: "hopper-prod")
--qos: Slurm QOS to use (default: "normal")
--time_limit: Time limit for the job in HH:MM:SS format (default: "1:00:00")
--email: Email address for job notifications
--tmp_dir: Temporary directory on compute nodes (default: "/tmp")
--pre_launch_commands: Commands to run before job launch
--extra_env: Additional environment variables to set

Model Configuration

--model_size: Predefined size (tiny, small, base, large)
--hidden_size, --intermediate_size, --num_layers, etc.: Custom model dimensions

Training Configuration

--seed: Random seed for reproducibility (default: 42)
--train_steps: Number of training steps (default: 10000)
--micro_batch_size: Size of micro batches (default: 2)
--grad_accum_steps: Gradient accumulation steps (default: 8)
--learning_rate: Peak learning rate (default: 3e-4)
--min_lr: Minimum learning rate for decay (default: 3e-5)
--weight_decay: Weight decay (default: 0.01)
--grad_clip: Gradient clipping (default: 1.0)
--warmup_steps: Learning rate warmup steps (default: 1000)

Parallelism Strategy

--dp: Data parallelism (DP) degree (default: 8)
--pp: Pipeline parallelism (PP) degree (default: 1)
--tp: Tensor parallelism (TP) degree (default: 2)

Note: Make sure that DP × PP × TP does not exceed your total number of GPUs (nodes × gpus_per_node).

Dataset Configuration

--dataset: Hugging Face dataset name or path (default: "stas/openwebtext-10k")
--text_column: Column name for text in the dataset (default: "text")
--tokenizer: Tokenizer name or path (default: "robot-test/dummy-tokenizer-wordlevel")

Path Configuration

--project: Project name for logging (default: "nanotron")
--configs_path: Where to save configurations (default: "logs/configs")
--slurm_logs_path: Where to save Slurm output logs (default: "logs/slurm_logs")
--checkpoints_path: Where to save model checkpoints (default: "checkpoints")
--slurm_scripts_dir: Directory to save generated Slurm scripts (default: "logs/slurm_scripts")
--run_train_script: Custom training script path (default: "run_train.py")
--save_interval: Interval for saving checkpoints in steps (default: 1000)
--save_initial_state: Save initial model state before training

Logging Configuration

--wandb_disabled: Disable logging to Weights & Biases
--profiler_export_path: Path to export the profiler tensorboard data

Execution Control

--dry_run: Generate configs but don't submit job
--show_logs: Show output of the job as it runs

Examples

Training a Small Model for Testing

python slurm_launcher.py \
  --run_name quick_test \
  --nodes 2 \
  --model_size tiny \
  --train_steps 100 \
  --dataset stas/openwebtext-10k

Training a Large Model for Production

python slurm_launcher.py \
  --run_name production_run \
  --nodes 8 \
  --model_size large \
  --dp 4 \
  --pp 2 \
  --tp 2 \
  --train_steps 50000 \
  --learning_rate 2e-4 \
  --warmup_steps 2000 \
  --dataset my_dataset \
  --tokenizer my_tokenizer \
  --email researcher@example.com \
  --time_limit 72:00:00

Note: In this example, we're using 16 GPUs for training (4×2×2 = 16) out of 64 available GPUs (8 nodes × 8 GPUs).

Custom Model Architecture

python slurm_launcher.py \
  --run_name custom_arch \
  --nodes 4 \
  --hidden_size 1536 \
  --num_layers 24 \
  --num_heads 16 \
  --num_kv_heads 4 \
  --train_steps 20000

Dry Run (Generate Config Without Submitting)

python slurm_launcher.py \
  --run_name test_config \
  --model_size base \
  --dry_run

Using a Custom Training Script

python slurm_launcher.py \
  --run_name custom_script \
  --nodes 2 \
  --model_size base \
  --run_train_script path/to/my_custom_train.py \
  --slurm_scripts_dir slurm_scripts

Monitoring Training Output in Real-time

python slurm_launcher.py \
  --run_name monitored_run \
  --nodes 2 \
  --model_size base \
  --show_logs

Manual Multi-Node Configuration

If you prefer to set up multi-node training manually, follow these steps:

Create a Nanotron configuration file (YAML or Python)

Set appropriate parallelism parameters:

parallelism = ParallelismArgs(
    dp=8,  # Adjust based on (total_gpus / (pp * tp))
    pp=2,  # Pipeline parallelism degree
    tp=2,  # Tensor parallelism degree
    pp_engine="1f1b",
    tp_mode="REDUCE_SCATTER",
    tp_linear_async_communication=True,
)

Create a Slurm batch script:

#!/bin/bash
#SBATCH --job-name=nanotron-training
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=1
#SBATCH --gpus-per-node=8
#SBATCH --partition=your_partition
#SBATCH --output=logs/%x-%j.out

export HOSTNAMES=`scontrol show hostnames "$SLURM_JOB_NODELIST"`
export MASTER_ADDR=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_PORT=6000
export COUNT_NODE=`scontrol show hostnames "$SLURM_JOB_NODELIST" | wc -l`

export TMPDIR=/tmp
export CUDA_DEVICE_MAX_CONNECTIONS=1

srun bash -c "torchrun \
    --nproc_per_node 8 \
    --nnodes $COUNT_NODE \
    --rdzv_backend c10d \
    --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --max_restarts 0 \
    run_train.py --config-file your_config.yaml"

Submit the job:
```
sbatch your_job_script.sh
```

Tips for Multi-Node Training

Node Communication: Ensure your cluster has a high-speed interconnect (like InfiniBand) for efficient multi-node communication.
Balanced Parallelism:
- For small models (< 1B parameters): Focus on data parallelism
- For medium models (1-10B): Use TP=2, PP=1 or PP=2
- For large models (>10B): Increase both TP and PP
Fault Tolerance: Configure --save_interval to save regularly in case of job failures.
Monitoring: Use --show_logs to monitor training progress in real-time.
Resource Efficiency: Balance your parallelism settings (--dp, --pp, --tp) to maximize GPU utilization.
Environment Variables: Use --extra_env to set additional environment variables like NCCL settings.

Troubleshooting

Common Issues

GPU Communication Errors
- Check network connectivity between nodes
- Try setting CUDA_DEVICE_MAX_CONNECTIONS=1
- Use environment variables like NCCL_DEBUG=WARN for debugging
Out of Memory Errors
- Reduce batch size or sequence length
- Increase pipeline parallelism
- Consider gradient checkpointing
Job Timeouts
- Increase --time_limit parameter
- Set appropriate checkpointing intervals with --save_interval

For more detailed information, refer to the Nanotron documentation and your cluster's specific Slurm documentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!