docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing by samsja · Pull Request #2552 · PrimeIntellect-ai/prime-rl

samsja · 2026-05-19T01:06:27Z

Why

On an 8-GPU k8s trainer pod, the dashboard's Trainer log tab shows every line 8 times because the torchrun snippet documented in docs/kubernetes.md (and implied by the empty trainer.command in k8s/prime-rl/values.yaml) doesn't filter per-rank stdout. Each rank's stdout reaches the pod's console, k8s ships every line to Loki, and the dashboard groups by role=trainer.

This is the same problem #2550 fixes, but #2550 fixes it by building loguru with no sinks on non-zero ranks (setup_logger(..., rank_zero_only=True)), which throws away per-rank info entirely (per-rank throughput / memory / debug traces on rank 5 are simply gone).

What

docs/kubernetes.md: add --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=/data/outputs/logs/trainer/torchrun to the documented torchrun invocation — the same flags the local launcher (src/prime_rl/entrypoints/rl.py:311-314), the SFT launcher (src/prime_rl/entrypoints/sft.py:123-126), and both SLURM templates (src/prime_rl/templates/multi_node_{rl,sft}.sbatch.j2) already use.
k8s/prime-rl/values.yaml: update the trainer.command comment so users authoring multi-GPU helm values see the same torchrun pattern without leaving the chart.

Result:

Only rank 0's stdout/stderr reach the pod's console → Loki ingests each log line once → dashboard shows it once.
Every rank's stdout/stderr is still written to /data/outputs/logs/trainer/torchrun/{rdzv_id}/attempt_0/{rank}/{stdout,stderr}.log under the mounted PVC, matching the layout already described in docs/logging.md.

Relation to #2550

This is intended as a replacement for #2550, not a stack on top of it. With these docs / chart-comment changes in place the rank_zero_only flag in setup_logger isn't needed, and we keep per-rank logs available for debugging hangs / per-rank divergence.

🤖 Generated with Claude Code

The trainer torchrun snippet in docs/kubernetes.md previously omitted --local-ranks-filter / --tee / --redirect / --log-dir, so every rank's stdout reached the pod's console and Loki ingested N copies of each line (visible as the duplicated lines in the dashboard's Trainer log tab on an N-GPU trainer pod). Add the same flags the local launcher and SLURM templates use: --local-ranks-filter=0 + --tee=3 keep only rank 0 on the pod console, while --redirect=3 + --log-dir=/data/outputs/logs/trainer/torchrun still writes every rank's stdout/stderr to per-rank files under the mounted PVC for debugging. This is an alternative to #2550 (which fixed the same dashboard duplication by silencing loguru on non-zero ranks in setup_logger). Doing it at torchrun keeps per-rank logs available on disk and avoids the in-process rank_zero_only flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…attern Mention the --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=... torchrun flags directly in the values.yaml trainer.command comment so users authoring multi-GPU helm values see them without having to read docs/kubernetes.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

samsja and others added 2 commits May 18, 2026 18:05

JannikSt mentioned this pull request May 19, 2026

fix(trainer): suppress duplicated trainer logs on non-zero ranks #2550

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing#2552

docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing#2552
samsja wants to merge 2 commits into
mainfrom
fix/k8s-trainer-torchrun-log-filter

samsja commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samsja commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Relation to #2550

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

samsja commented May 19, 2026 •

edited

Loading