Skip to content

docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing#2552

Draft
samsja wants to merge 2 commits into
mainfrom
fix/k8s-trainer-torchrun-log-filter
Draft

docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing#2552
samsja wants to merge 2 commits into
mainfrom
fix/k8s-trainer-torchrun-log-filter

Conversation

@samsja
Copy link
Copy Markdown
Member

@samsja samsja commented May 19, 2026

Why

On an 8-GPU k8s trainer pod, the dashboard's Trainer log tab shows every line 8 times because the torchrun snippet documented in docs/kubernetes.md (and implied by the empty trainer.command in k8s/prime-rl/values.yaml) doesn't filter per-rank stdout. Each rank's stdout reaches the pod's console, k8s ships every line to Loki, and the dashboard groups by role=trainer.

This is the same problem #2550 fixes, but #2550 fixes it by building loguru with no sinks on non-zero ranks (setup_logger(..., rank_zero_only=True)), which throws away per-rank info entirely (per-rank throughput / memory / debug traces on rank 5 are simply gone).

What

  • docs/kubernetes.md: add --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=/data/outputs/logs/trainer/torchrun to the documented torchrun invocation — the same flags the local launcher (src/prime_rl/entrypoints/rl.py:311-314), the SFT launcher (src/prime_rl/entrypoints/sft.py:123-126), and both SLURM templates (src/prime_rl/templates/multi_node_{rl,sft}.sbatch.j2) already use.
  • k8s/prime-rl/values.yaml: update the trainer.command comment so users authoring multi-GPU helm values see the same torchrun pattern without leaving the chart.

Result:

  • Only rank 0's stdout/stderr reach the pod's console → Loki ingests each log line once → dashboard shows it once.
  • Every rank's stdout/stderr is still written to /data/outputs/logs/trainer/torchrun/{rdzv_id}/attempt_0/{rank}/{stdout,stderr}.log under the mounted PVC, matching the layout already described in docs/logging.md.

Relation to #2550

This is intended as a replacement for #2550, not a stack on top of it. With these docs / chart-comment changes in place the rank_zero_only flag in setup_logger isn't needed, and we keep per-rank logs available for debugging hangs / per-rank divergence.

🤖 Generated with Claude Code

samsja and others added 2 commits May 18, 2026 18:05
The trainer torchrun snippet in docs/kubernetes.md previously omitted
--local-ranks-filter / --tee / --redirect / --log-dir, so every rank's
stdout reached the pod's console and Loki ingested N copies of each line
(visible as the duplicated lines in the dashboard's Trainer log tab on
an N-GPU trainer pod).

Add the same flags the local launcher and SLURM templates use:
--local-ranks-filter=0 + --tee=3 keep only rank 0 on the pod console,
while --redirect=3 + --log-dir=/data/outputs/logs/trainer/torchrun
still writes every rank's stdout/stderr to per-rank files under the
mounted PVC for debugging.

This is an alternative to #2550 (which fixed the same dashboard
duplication by silencing loguru on non-zero ranks in setup_logger).
Doing it at torchrun keeps per-rank logs available on disk and avoids
the in-process rank_zero_only flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…attern

Mention the --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=...
torchrun flags directly in the values.yaml trainer.command comment so
users authoring multi-GPU helm values see them without having to read
docs/kubernetes.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant