fix(trainer): suppress duplicated trainer logs on non-zero ranks by JannikSt · Pull Request #2550 · PrimeIntellect-ai/prime-rl

JannikSt · 2026-05-19T00:22:16Z

Why

On an 8-GPU trainer pod, the dashboard's Trainer log tab on /dashboard/training/<id> shows every log line 8 times - e.g. 8 identical "Starting training loop" lines and 8 near-identical "Step 0 | Time ..." lines (each with slightly different per-rank throughput/memory).

torchrun --nproc-per-node=N spawns N processes, each writes to stdout, k8s ships each stdout to Loki as separate streams, and the platform groups by role=trainer. The existing --local-ranks-filter plumbing in entrypoints/rl.py only applies to the launcher-managed single-node path; on k8s the trainer module is invoked directly via torchrun, bypassing that filter.

What

setup_logger gains a rank_zero_only flag. When True and RANK > 0 (read from torchrun's env var, set before dist init), the loguru instance is built with no sinks so every logger.info / .success / .debug / .warning call becomes a no-op on non-zero ranks.
trainer/rl/train.py and trainer/sft/train.py opt in.
All other setup_logger callers (orchestrator, inference, RL launcher) are unchanged - they're single-process and never had the fan-out.

Notes

Uses global RANK (not LOCAL_RANK) so it works correctly on multi-node setups where only one trainer pod has global rank 0.
Reads os.environ["RANK"] directly rather than going through torch.distributed.get_rank(), so it works for any log line emitted before dist.init_process_group.

Note

Low Risk
Low risk: changes only logging behavior by silencing non-zero distributed ranks, which could reduce per-rank visibility but does not affect training computation.

Overview
Reduces duplicated logs in distributed training runs. setup_logger now supports rank_zero_only, which detects non-zero RANK (from torchrun) and creates a loguru logger with no sinks so log calls become no-ops on those ranks.

Both RL and SFT trainers opt into this (rank_zero_only=True) to prevent N-way repeated stdout lines when Kubernetes/Loki merges per-rank streams.

^{Reviewed by Cursor Bugbot for commit e3fe296. Bugbot is set up for automated code reviews on this repo. Configure here.}

Every torchrun rank writes to its own stdout, but in k8s all rank stdouts merge into a single Loki stream grouped by role=trainer. The dashboard's Trainer log tab then shows N copies of every line (one per GPU), e.g. 8 identical 'Starting training loop' lines and 8 near- identical 'Step 0 | ...' lines on an 8-GPU pod. The existing torchrun --local-ranks-filter plumbing in entrypoints/rl.py only applies to the launcher-managed single-node path; on k8s the trainer is invoked directly via torchrun, bypassing that filter. Fix it at the logger level: add a rank_zero_only flag to setup_logger that builds a sink-less loguru instance on non-zero global ranks (read from torchrun's RANK env var, set before dist init). The RL and SFT trainer entrypoints opt in. Default behavior is unchanged for all other setup_logger callers (orchestrator, inference, launcher).

JannikSt · 2026-05-19T00:25:59Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3fe296b87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T00:27:08Z

-        logger.add(sys.stdout, format=format, level=log_level.upper(), colorize=True)
+    # Install console handler (enqueue=True only for JSON mode to avoid blocking in async contexts).
+    # Silent ranks get a logger with no sinks so all log calls become no-ops.
+    if not is_silent_rank:


Keep fatal errors visible on non-zero ranks

When rank_zero_only=True is used by the RL/SFT trainers, this branch leaves non-zero ranks with no log sinks at all, so clean_exit's get_logger().opt(exception=True).error(...) becomes a no-op before it calls sys.exit(1). If only rank >0 hits a data/model/distributed error, torchrun will report a child failure but the original traceback is swallowed by clean_exit, making those failures effectively undiagnosable. Consider still emitting error/exception-level logs to stderr on silent ranks or bypassing the no-sink logger for fatal cleanup paths.

Useful? React with 👍 / 👎.

mikasenghaas · 2026-05-19T00:48:48Z

lol @samsja wants this

JannikSt · 2026-05-19T04:29:14Z

Superseded by #2552 (Sami's launcher-flag approach) — better fix because it preserves per-rank logs on the PVC instead of silencing non-zero ranks entirely. Platform-side change to thread the same --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=... flags into the chart's torchrun command lives in PrimeIntellect-ai/platform PR #2071 (commit 36320d073).

JannikSt marked this pull request as ready for review May 19, 2026 00:25

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

samsja mentioned this pull request May 19, 2026

docs(k8s): filter trainer torchrun logs to rank 0 instead of silencing #2552

Draft

JannikSt closed this May 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(trainer): suppress duplicated trainer logs on non-zero ranks#2550

fix(trainer): suppress duplicated trainer logs on non-zero ranks#2550
JannikSt wants to merge 1 commit into
mainfrom
improvement/rank0-only-trainer-logs

JannikSt commented May 19, 2026 •

edited by cursor Bot

Loading

Uh oh!

JannikSt commented May 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

mikasenghaas commented May 19, 2026

Uh oh!

JannikSt commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JannikSt commented May 19, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Notes

Uh oh!

JannikSt commented May 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

mikasenghaas commented May 19, 2026

Uh oh!

JannikSt commented May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JannikSt commented May 19, 2026 •

edited by cursor Bot

Loading