|
1 | | -# Get Started |
| 1 | +# Getting Started |
2 | 2 |
|
3 | | -Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models. |
| 3 | +This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job. |
4 | 4 |
|
5 | | -TorchForge specializes in post-training techniques for large language models, including: |
| 5 | +## System Requirements |
6 | 6 |
|
7 | | -- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data |
8 | | -- **Group Relative Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment |
9 | | -- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes |
| 7 | +Before installing TorchForge, ensure your system meets the following requirements. |
| 8 | + |
| 9 | +| Component | Requirement | Notes | |
| 10 | +|-----------|-------------|-------| |
| 11 | +| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported | |
| 12 | +| **Python** | 3.10 or higher | Python 3.11 recommended | |
| 13 | +| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported | |
| 14 | +| **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models | |
| 15 | +| **CUDA** | 12.8 | Required for GPU training | |
| 16 | +| **RAM** | 32GB+ recommended | Depends on model size | |
| 17 | +| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints | |
| 18 | +| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) | |
| 19 | +| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system | |
| 20 | +| **vLLM** | v0.10.0+ | Fast inference with PagedAttention | |
| 21 | +| **TorchTitan** | Latest | Production training infrastructure | |
| 22 | + |
| 23 | + |
| 24 | +## Prerequisites |
| 25 | + |
| 26 | +- **Conda or Miniconda**: For environment management |
| 27 | + - Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html) |
| 28 | + |
| 29 | +- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies |
| 30 | + - Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation) |
| 31 | + - After installing, authenticate with: `gh auth login` |
| 32 | + - You can use either HTTPS or SSH as the authentication protocol |
| 33 | + |
| 34 | +- **Git**: For cloning the repository |
| 35 | + - Usually pre-installed on Linux systems |
| 36 | + - Verify with: `git --version` |
| 37 | + |
| 38 | + |
| 39 | +**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included. |
| 40 | + |
| 41 | +## Installation |
| 42 | + |
| 43 | +TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable. |
| 44 | + |
| 45 | +1. **Clone the Repository** |
| 46 | + |
| 47 | + ```bash |
| 48 | + git clone https://github.com/meta-pytorch/forge.git |
| 49 | + cd forge |
| 50 | + ``` |
| 51 | + |
| 52 | +2. **Create Conda Environment** |
| 53 | + |
| 54 | + ```bash |
| 55 | + conda create -n forge python=3.10 |
| 56 | + conda activate forge |
| 57 | + ``` |
| 58 | + |
| 59 | +3. **Run Installation Script** |
| 60 | + |
| 61 | + ```bash |
| 62 | + ./scripts/install.sh |
| 63 | + ``` |
| 64 | + |
| 65 | + The installation script will: |
| 66 | + - Install system dependencies using DNF (or your package manager) |
| 67 | + - Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan |
| 68 | + - Install TorchForge and all Python dependencies |
| 69 | + - Configure the environment for GPU training |
| 70 | + |
| 71 | + ```{tip} |
| 72 | + **Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use: |
| 73 | + `./scripts/install.sh --use-sudo` |
| 74 | + ``` |
| 75 | + |
| 76 | + ```{warning} |
| 77 | + When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM. |
| 78 | + ``` |
| 79 | + |
| 80 | +## Verifying Your Setup |
| 81 | + |
| 82 | +After installation, verify that all components are working correctly: |
| 83 | + |
| 84 | +1. **Check GPU Availability** |
| 85 | + |
| 86 | + ```bash |
| 87 | + python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')" |
| 88 | + ``` |
| 89 | + |
| 90 | + Expected output: `GPUs available: 2` (or more) |
| 91 | + |
| 92 | +2. **Check CUDA Version** |
| 93 | + |
| 94 | + ```bash |
| 95 | + python -c "import torch; print(f'CUDA version: {torch.version.cuda}')" |
| 96 | + ``` |
| 97 | + |
| 98 | + Expected output: `CUDA version: 12.8` |
| 99 | +3. **Check All Dependencies** |
| 100 | + |
| 101 | + ```bash |
| 102 | + # Check core components |
| 103 | + python -c "import torch, forge, monarch, vllm; print('All imports successful')" |
| 104 | + |
| 105 | + # Check specific versions |
| 106 | + python -c " |
| 107 | + import torch |
| 108 | + import forge |
| 109 | + import vllm |
| 110 | +
|
| 111 | + print(f'PyTorch: {torch.__version__}') |
| 112 | + print(f'TorchForge: {forge.__version__}') |
| 113 | + print(f'vLLM: {vllm.__version__}') |
| 114 | + print(f'CUDA: {torch.version.cuda}') |
| 115 | + print(f'GPUs: {torch.cuda.device_count()}') |
| 116 | + " |
| 117 | + ``` |
| 118 | + |
| 119 | +4. **Verify Monarch** |
| 120 | + |
| 121 | + ```bash |
| 122 | + python -c " |
| 123 | + from monarch.actor import Actor, this_host |
| 124 | +
|
| 125 | + # Test basic Monarch functionality |
| 126 | + procs = this_host().spawn_procs({'gpus': 1}) |
| 127 | + print('Monarch: Process spawning works') |
| 128 | + " |
| 129 | + ``` |
| 130 | + |
| 131 | +## Quick Start Examples |
| 132 | + |
| 133 | +Now that TorchForge is installed, let's run some training examples. |
| 134 | + |
| 135 | +Here's what training looks like with TorchForge: |
| 136 | + |
| 137 | +```bash |
| 138 | +# Install dependencies |
| 139 | +conda create -n forge python=3.10 |
| 140 | +conda activate forge |
| 141 | +git clone https://github.com/meta-pytorch/forge |
| 142 | +cd forge |
| 143 | +./scripts/install.sh |
| 144 | + |
| 145 | +# Download a model |
| 146 | +hf download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /tmp/Meta-Llama-3.1-8B-Instruct --exclude "original/consolidated.00.pth" |
| 147 | + |
| 148 | +# Run SFT training (requires 2+ GPUs) |
| 149 | +uv run forge run --nproc_per_node 2 \ |
| 150 | + apps/sft/main.py --config apps/sft/llama3_8b.yaml |
| 151 | + |
| 152 | +# Run GRPO training (requires 3+ GPUs) |
| 153 | +python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml |
| 154 | +``` |
| 155 | + |
| 156 | +### Example 1: Supervised Fine-Tuning (SFT) |
| 157 | + |
| 158 | +Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs** |
| 159 | + |
| 160 | +1. **Access the Model** |
| 161 | + |
| 162 | + ```{note} |
| 163 | + Model downloads are no longer required, but Hugging Face authentication is required to access the models. |
| 164 | +
|
| 165 | + Run `huggingface-cli login` first if you haven't already. |
| 166 | + ``` |
| 167 | + |
| 168 | +2. **Run Training** |
| 169 | + |
| 170 | + ```bash |
| 171 | + python -m apps.sft.main --config apps/sft/llama3_8b.yaml |
| 172 | + ``` |
| 173 | + |
| 174 | + **What's Happening:** |
| 175 | + - `--nproc_per_node 2`: Use 2 GPUs for training |
| 176 | + - `apps/sft/main.py`: SFT training script |
| 177 | + - `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters |
| 178 | + - **TorchTitan** handles model sharding across the 2 GPUs |
| 179 | + - **Monarch** coordinates the distributed training |
| 180 | + |
| 181 | +### Example 2: GRPO Training |
| 182 | + |
| 183 | +Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs** |
| 184 | + |
| 185 | +```bash |
| 186 | +python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml |
| 187 | +``` |
| 188 | + |
| 189 | +**What's Happening:** |
| 190 | +- GPU 0: Trainer model (being trained, powered by TorchTitan) |
| 191 | +- GPU 1: Reference model (frozen baseline, powered by TorchTitan) |
| 192 | +- GPU 2: Policy model (scoring outputs, powered by vLLM) |
| 193 | +- **Monarch** orchestrates all three components |
| 194 | +- **TorchStore** handles weight synchronization from training to inference |
| 195 | + |
| 196 | +## Understanding Configuration Files |
| 197 | + |
| 198 | +TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config: |
| 199 | + |
| 200 | +```yaml |
| 201 | +# Example: apps/sft/llama3_8b.yaml |
| 202 | +model: |
| 203 | + name: meta-llama/Meta-Llama-3.1-8B-Instruct |
| 204 | + path: /tmp/Meta-Llama-3.1-8B-Instruct |
| 205 | + |
| 206 | +training: |
| 207 | + batch_size: 4 |
| 208 | + learning_rate: 1e-5 |
| 209 | + num_epochs: 10 |
| 210 | + gradient_accumulation_steps: 4 |
| 211 | + |
| 212 | +distributed: |
| 213 | + strategy: fsdp # Managed by TorchTitan |
| 214 | + precision: bf16 |
| 215 | + |
| 216 | +checkpointing: |
| 217 | + save_interval: 1000 |
| 218 | + output_dir: /tmp/checkpoints |
| 219 | +``` |
| 220 | +
|
| 221 | +**Key Sections:** |
| 222 | +- **model**: Model path and settings |
| 223 | +- **training**: Hyperparameters like batch size and learning rate |
| 224 | +- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan |
| 225 | +- **checkpointing**: Where and when to save model checkpoints |
| 226 | +
|
| 227 | +## Next Steps |
| 228 | +
|
| 229 | +Now that you have TorchForge installed and verified: |
| 230 | +
|
| 231 | +1. **Explore Examples**: Check the `apps/` directory for more training examples |
| 232 | +2. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides |
| 233 | +3. **API Documentation**: Explore {doc}`api` for detailed API reference |
| 234 | + |
| 235 | +## Getting Help |
| 236 | + |
| 237 | +If you encounter issues: |
| 238 | + |
| 239 | +1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues) |
| 240 | +2. **File a Bug Report**: Create a new issue with: |
| 241 | + - Your system configuration (output of diagnostic command below) |
| 242 | + - Full error message |
| 243 | + - Steps to reproduce |
| 244 | + - Expected vs actual behavior |
| 245 | + |
| 246 | +**Diagnostic command:** |
| 247 | +```bash |
| 248 | +python -c " |
| 249 | +import torch |
| 250 | +import forge |
| 251 | +
|
| 252 | +try: |
| 253 | + import monarch |
| 254 | + monarch_status = 'OK' |
| 255 | +except Exception as e: |
| 256 | + monarch_status = str(e) |
| 257 | +
|
| 258 | +try: |
| 259 | + import vllm |
| 260 | + vllm_version = vllm.__version__ |
| 261 | +except Exception as e: |
| 262 | + vllm_version = str(e) |
| 263 | +
|
| 264 | +print(f'PyTorch: {torch.__version__}') |
| 265 | +print(f'TorchForge: {forge.__version__}') |
| 266 | +print(f'Monarch: {monarch_status}') |
| 267 | +print(f'vLLM: {vllm_version}') |
| 268 | +print(f'CUDA: {torch.version.cuda}') |
| 269 | +print(f'GPUs: {torch.cuda.device_count()}') |
| 270 | +" |
| 271 | +``` |
| 272 | + |
| 273 | +Include this output in your bug reports! |
| 274 | + |
| 275 | +## Additional Resources |
| 276 | + |
| 277 | +- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md) |
| 278 | +- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md) |
| 279 | +- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch) |
| 280 | +- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai) |
| 281 | +- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan) |
0 commit comments