Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions docs/kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -258,9 +258,19 @@ torchrun \
--node-rank=$RANK \
--nproc-per-node=8 \
--rdzv-endpoint=my-exp-trainer-0.$HEADLESS_SERVICE:29501 \
--log-dir=/data/outputs/logs/trainer/torchrun \
--local-ranks-filter=0 \
--redirect=3 \
--tee=3 \
src/prime_rl/trainer/sft/train.py @ configs/train.toml
```

`--local-ranks-filter=0 --tee=3` keeps only rank 0's stdout/stderr on the pod's
console (so Loki/the dashboard see each log line once instead of N times for an
N-GPU pod), while `--redirect=3 --log-dir=...` still writes every rank's
stdout/stderr to per-rank files under the mounted PVC for debugging. This
matches what the launcher does on single-node / SLURM deployments.

## Troubleshooting

### Can't access shared storage
Expand Down
6 changes: 5 additions & 1 deletion k8s/prime-rl/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,11 @@ trainer:

# Auto-start configuration (set to false to use sleep infinity for debugging)
autoStart: false
command: "" # e.g., "uv run trainer @ /app/examples/reverse_text/rl/train.toml --output-dir /data/outputs"
# Single GPU: "uv run trainer @ /app/examples/reverse_text/rl/train.toml --output-dir /data/outputs"
# Multi-GPU: use torchrun and pass --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=... so
# only rank 0's stdout reaches the pod console (Loki/dashboard see each line once) while every
# rank's stdout/stderr is still written to per-rank files. See docs/kubernetes.md.
command: ""

# Helps reduce CUDA memory fragmentation with PyTorch allocator
pytorchCudaAllocConf: "expandable_segments:True"
Expand Down
Loading