Skip to content

Commit 8b753f8

Browse files
AlannaBurkesvekars
andauthored
Docs Content Part 1: Homepage and getting started (#448)
Co-authored-by: Svetlana Karslioglu <[email protected]>
1 parent 7102ef2 commit 8b753f8

File tree

5 files changed

+467
-24
lines changed

5 files changed

+467
-24
lines changed

docs/source/concepts.md

Lines changed: 0 additions & 4 deletions
This file was deleted.

docs/source/conf.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,8 +140,8 @@ def get_version_path():
140140
"navbar_center": "navbar-nav",
141141
"canonical_url": "https://meta-pytorch.org/forge/",
142142
"header_links_before_dropdown": 7,
143-
"show_nav_level": 2,
144143
"show_toc_level": 2,
144+
"navigation_depth": 3,
145145
}
146146

147147
theme_variables = pytorch_sphinx_theme2.get_theme_variables()
@@ -173,6 +173,7 @@ def get_version_path():
173173
"colon_fence",
174174
"deflist",
175175
"html_image",
176+
"substitution",
176177
]
177178

178179
# Configure MyST parser to treat mermaid code blocks as mermaid directives

docs/source/getting_started.md

Lines changed: 278 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,281 @@
1-
# Get Started
1+
# Getting Started
22

3-
Welcome to TorchForge! This guide will help you get up and running with TorchForge, a PyTorch-native platform specifically designed for post-training generative AI models.
3+
This guide will walk you through installing TorchForge, understanding its dependencies, verifying your setup, and running your first training job.
44

5-
TorchForge specializes in post-training techniques for large language models, including:
5+
## System Requirements
66

7-
- **Supervised Fine-Tuning (SFT)**: Adapt pre-trained models to specific tasks using labeled data
8-
- **Group Relative Policy Optimization (GRPO)**: Advanced reinforcement learning for model alignment
9-
- **Multi-GPU Distributed Training**: Efficient scaling across multiple GPUs and nodes
7+
Before installing TorchForge, ensure your system meets the following requirements.
8+
9+
| Component | Requirement | Notes |
10+
|-----------|-------------|-------|
11+
| **Operating System** | Linux (Fedora/Ubuntu/Debian) | MacOS and Windows not currently supported |
12+
| **Python** | 3.10 or higher | Python 3.11 recommended |
13+
| **GPU** | NVIDIA with CUDA support | AMD GPUs not currently supported |
14+
| **Minimum GPUs** | 2+ for SFT, 3+ for GRPO | More GPUs enable larger models |
15+
| **CUDA** | 12.8 | Required for GPU training |
16+
| **RAM** | 32GB+ recommended | Depends on model size |
17+
| **Disk Space** | 50GB+ free | For models, datasets, and checkpoints |
18+
| **PyTorch** | Nightly build | Latest distributed features (DTensor, FSDP) |
19+
| **Monarch** | Pre-packaged wheel | Distributed orchestration and actor system |
20+
| **vLLM** | v0.10.0+ | Fast inference with PagedAttention |
21+
| **TorchTitan** | Latest | Production training infrastructure |
22+
23+
24+
## Prerequisites
25+
26+
- **Conda or Miniconda**: For environment management
27+
- Download from [conda.io](https://docs.conda.io/en/latest/miniconda.html)
28+
29+
- **GitHub CLI (gh)**: Required for downloading pre-packaged dependencies
30+
- Install instructions: [github.com/cli/cli#installation](https://github.com/cli/cli#installation)
31+
- After installing, authenticate with: `gh auth login`
32+
- You can use either HTTPS or SSH as the authentication protocol
33+
34+
- **Git**: For cloning the repository
35+
- Usually pre-installed on Linux systems
36+
- Verify with: `git --version`
37+
38+
39+
**Installation note:** The installation script provides pre-built wheels with PyTorch nightly already included.
40+
41+
## Installation
42+
43+
TorchForge uses pre-packaged wheels for all dependencies, making installation faster and more reliable.
44+
45+
1. **Clone the Repository**
46+
47+
```bash
48+
git clone https://github.com/meta-pytorch/forge.git
49+
cd forge
50+
```
51+
52+
2. **Create Conda Environment**
53+
54+
```bash
55+
conda create -n forge python=3.10
56+
conda activate forge
57+
```
58+
59+
3. **Run Installation Script**
60+
61+
```bash
62+
./scripts/install.sh
63+
```
64+
65+
The installation script will:
66+
- Install system dependencies using DNF (or your package manager)
67+
- Download pre-built wheels for PyTorch nightly, Monarch, vLLM, and TorchTitan
68+
- Install TorchForge and all Python dependencies
69+
- Configure the environment for GPU training
70+
71+
```{tip}
72+
**Using sudo instead of conda**: If you prefer installing system packages directly rather than through conda, use:
73+
`./scripts/install.sh --use-sudo`
74+
```
75+
76+
```{warning}
77+
When adding packages to `pyproject.toml`, use `uv sync --inexact` to avoid removing Monarch and vLLM.
78+
```
79+
80+
## Verifying Your Setup
81+
82+
After installation, verify that all components are working correctly:
83+
84+
1. **Check GPU Availability**
85+
86+
```bash
87+
python -c "import torch; print(f'GPUs available: {torch.cuda.device_count()}')"
88+
```
89+
90+
Expected output: `GPUs available: 2` (or more)
91+
92+
2. **Check CUDA Version**
93+
94+
```bash
95+
python -c "import torch; print(f'CUDA version: {torch.version.cuda}')"
96+
```
97+
98+
Expected output: `CUDA version: 12.8`
99+
3. **Check All Dependencies**
100+
101+
```bash
102+
# Check core components
103+
python -c "import torch, forge, monarch, vllm; print('All imports successful')"
104+
105+
# Check specific versions
106+
python -c "
107+
import torch
108+
import forge
109+
import vllm
110+
111+
print(f'PyTorch: {torch.__version__}')
112+
print(f'TorchForge: {forge.__version__}')
113+
print(f'vLLM: {vllm.__version__}')
114+
print(f'CUDA: {torch.version.cuda}')
115+
print(f'GPUs: {torch.cuda.device_count()}')
116+
"
117+
```
118+
119+
4. **Verify Monarch**
120+
121+
```bash
122+
python -c "
123+
from monarch.actor import Actor, this_host
124+
125+
# Test basic Monarch functionality
126+
procs = this_host().spawn_procs({'gpus': 1})
127+
print('Monarch: Process spawning works')
128+
"
129+
```
130+
131+
## Quick Start Examples
132+
133+
Now that TorchForge is installed, let's run some training examples.
134+
135+
Here's what training looks like with TorchForge:
136+
137+
```bash
138+
# Install dependencies
139+
conda create -n forge python=3.10
140+
conda activate forge
141+
git clone https://github.com/meta-pytorch/forge
142+
cd forge
143+
./scripts/install.sh
144+
145+
# Download a model
146+
hf download meta-llama/Meta-Llama-3.1-8B-Instruct --local-dir /tmp/Meta-Llama-3.1-8B-Instruct --exclude "original/consolidated.00.pth"
147+
148+
# Run SFT training (requires 2+ GPUs)
149+
uv run forge run --nproc_per_node 2 \
150+
apps/sft/main.py --config apps/sft/llama3_8b.yaml
151+
152+
# Run GRPO training (requires 3+ GPUs)
153+
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
154+
```
155+
156+
### Example 1: Supervised Fine-Tuning (SFT)
157+
158+
Fine-tune Llama 3 8B on your data. **Requires: 2+ GPUs**
159+
160+
1. **Access the Model**
161+
162+
```{note}
163+
Model downloads are no longer required, but Hugging Face authentication is required to access the models.
164+
165+
Run `huggingface-cli login` first if you haven't already.
166+
```
167+
168+
2. **Run Training**
169+
170+
```bash
171+
python -m apps.sft.main --config apps/sft/llama3_8b.yaml
172+
```
173+
174+
**What's Happening:**
175+
- `--nproc_per_node 2`: Use 2 GPUs for training
176+
- `apps/sft/main.py`: SFT training script
177+
- `--config apps/sft/llama3_8b.yaml`: Configuration file with hyperparameters
178+
- **TorchTitan** handles model sharding across the 2 GPUs
179+
- **Monarch** coordinates the distributed training
180+
181+
### Example 2: GRPO Training
182+
183+
Train a model using reinforcement learning with GRPO. **Requires: 3+ GPUs**
184+
185+
```bash
186+
python -m apps.grpo.main --config apps/grpo/qwen3_1_7b.yaml
187+
```
188+
189+
**What's Happening:**
190+
- GPU 0: Trainer model (being trained, powered by TorchTitan)
191+
- GPU 1: Reference model (frozen baseline, powered by TorchTitan)
192+
- GPU 2: Policy model (scoring outputs, powered by vLLM)
193+
- **Monarch** orchestrates all three components
194+
- **TorchStore** handles weight synchronization from training to inference
195+
196+
## Understanding Configuration Files
197+
198+
TorchForge uses YAML configuration files to manage training parameters. Let's examine a typical config:
199+
200+
```yaml
201+
# Example: apps/sft/llama3_8b.yaml
202+
model:
203+
name: meta-llama/Meta-Llama-3.1-8B-Instruct
204+
path: /tmp/Meta-Llama-3.1-8B-Instruct
205+
206+
training:
207+
batch_size: 4
208+
learning_rate: 1e-5
209+
num_epochs: 10
210+
gradient_accumulation_steps: 4
211+
212+
distributed:
213+
strategy: fsdp # Managed by TorchTitan
214+
precision: bf16
215+
216+
checkpointing:
217+
save_interval: 1000
218+
output_dir: /tmp/checkpoints
219+
```
220+
221+
**Key Sections:**
222+
- **model**: Model path and settings
223+
- **training**: Hyperparameters like batch size and learning rate
224+
- **distributed**: Multi-GPU strategy (FSDP, tensor parallel, etc.) handled by TorchTitan
225+
- **checkpointing**: Where and when to save model checkpoints
226+
227+
## Next Steps
228+
229+
Now that you have TorchForge installed and verified:
230+
231+
1. **Explore Examples**: Check the `apps/` directory for more training examples
232+
2. **Read Tutorials**: Follow {doc}`tutorials` for step-by-step guides
233+
3. **API Documentation**: Explore {doc}`api` for detailed API reference
234+
235+
## Getting Help
236+
237+
If you encounter issues:
238+
239+
1. **Search Issues**: Look through [GitHub Issues](https://github.com/meta-pytorch/forge/issues)
240+
2. **File a Bug Report**: Create a new issue with:
241+
- Your system configuration (output of diagnostic command below)
242+
- Full error message
243+
- Steps to reproduce
244+
- Expected vs actual behavior
245+
246+
**Diagnostic command:**
247+
```bash
248+
python -c "
249+
import torch
250+
import forge
251+
252+
try:
253+
import monarch
254+
monarch_status = 'OK'
255+
except Exception as e:
256+
monarch_status = str(e)
257+
258+
try:
259+
import vllm
260+
vllm_version = vllm.__version__
261+
except Exception as e:
262+
vllm_version = str(e)
263+
264+
print(f'PyTorch: {torch.__version__}')
265+
print(f'TorchForge: {forge.__version__}')
266+
print(f'Monarch: {monarch_status}')
267+
print(f'vLLM: {vllm_version}')
268+
print(f'CUDA: {torch.version.cuda}')
269+
print(f'GPUs: {torch.cuda.device_count()}')
270+
"
271+
```
272+
273+
Include this output in your bug reports!
274+
275+
## Additional Resources
276+
277+
- **Contributing Guide**: [CONTRIBUTING.md](https://github.com/meta-pytorch/forge/blob/main/CONTRIBUTING.md)
278+
- **Code of Conduct**: [CODE_OF_CONDUCT.md](https://github.com/meta-pytorch/forge/blob/main/CODE_OF_CONDUCT.md)
279+
- **Monarch Documentation**: [meta-pytorch.org/monarch](https://meta-pytorch.org/monarch)
280+
- **vLLM Documentation**: [docs.vllm.ai](https://docs.vllm.ai)
281+
- **TorchTitan**: [github.com/pytorch/torchtitan](https://github.com/pytorch/torchtitan)

0 commit comments

Comments
 (0)