docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing#2552
Draft
samsja wants to merge 2 commits into
Draft
docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing#2552samsja wants to merge 2 commits into
samsja wants to merge 2 commits into
Conversation
The trainer torchrun snippet in docs/kubernetes.md previously omitted --local-ranks-filter / --tee / --redirect / --log-dir, so every rank's stdout reached the pod's console and Loki ingested N copies of each line (visible as the duplicated lines in the dashboard's Trainer log tab on an N-GPU trainer pod). Add the same flags the local launcher and SLURM templates use: --local-ranks-filter=0 + --tee=3 keep only rank 0 on the pod console, while --redirect=3 + --log-dir=/data/outputs/logs/trainer/torchrun still writes every rank's stdout/stderr to per-rank files under the mounted PVC for debugging. This is an alternative to #2550 (which fixed the same dashboard duplication by silencing loguru on non-zero ranks in setup_logger). Doing it at torchrun keeps per-rank logs available on disk and avoids the in-process rank_zero_only flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…attern Mention the --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=... torchrun flags directly in the values.yaml trainer.command comment so users authoring multi-GPU helm values see them without having to read docs/kubernetes.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
On an 8-GPU k8s trainer pod, the dashboard's Trainer log tab shows every line 8 times because the torchrun snippet documented in
docs/kubernetes.md(and implied by the emptytrainer.commandink8s/prime-rl/values.yaml) doesn't filter per-rank stdout. Each rank's stdout reaches the pod's console, k8s ships every line to Loki, and the dashboard groups byrole=trainer.This is the same problem #2550 fixes, but #2550 fixes it by building loguru with no sinks on non-zero ranks (
setup_logger(..., rank_zero_only=True)), which throws away per-rank info entirely (per-rank throughput / memory / debug traces on rank 5 are simply gone).What
docs/kubernetes.md: add--local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=/data/outputs/logs/trainer/torchrunto the documented torchrun invocation — the same flags the local launcher (src/prime_rl/entrypoints/rl.py:311-314), the SFT launcher (src/prime_rl/entrypoints/sft.py:123-126), and both SLURM templates (src/prime_rl/templates/multi_node_{rl,sft}.sbatch.j2) already use.k8s/prime-rl/values.yaml: update thetrainer.commandcomment so users authoring multi-GPU helm values see the same torchrun pattern without leaving the chart.Result:
/data/outputs/logs/trainer/torchrun/{rdzv_id}/attempt_0/{rank}/{stdout,stderr}.logunder the mounted PVC, matching the layout already described indocs/logging.md.Relation to #2550
This is intended as a replacement for #2550, not a stack on top of it. With these docs / chart-comment changes in place the
rank_zero_onlyflag insetup_loggerisn't needed, and we keep per-rank logs available for debugging hangs / per-rank divergence.🤖 Generated with Claude Code