Skip to content

fix(trainer): suppress duplicated trainer logs on non-zero ranks#2550

Closed
JannikSt wants to merge 1 commit into
mainfrom
improvement/rank0-only-trainer-logs
Closed

fix(trainer): suppress duplicated trainer logs on non-zero ranks#2550
JannikSt wants to merge 1 commit into
mainfrom
improvement/rank0-only-trainer-logs

Conversation

@JannikSt
Copy link
Copy Markdown
Member

@JannikSt JannikSt commented May 19, 2026

Why

On an 8-GPU trainer pod, the dashboard's Trainer log tab on /dashboard/training/<id> shows every log line 8 times - e.g. 8 identical "Starting training loop" lines and 8 near-identical "Step 0 | Time ..." lines (each with slightly different per-rank throughput/memory).

torchrun --nproc-per-node=N spawns N processes, each writes to stdout, k8s ships each stdout to Loki as separate streams, and the platform groups by role=trainer. The existing --local-ranks-filter plumbing in entrypoints/rl.py only applies to the launcher-managed single-node path; on k8s the trainer module is invoked directly via torchrun, bypassing that filter.

What

  • setup_logger gains a rank_zero_only flag. When True and RANK > 0 (read from torchrun's env var, set before dist init), the loguru instance is built with no sinks so every logger.info / .success / .debug / .warning call becomes a no-op on non-zero ranks.
  • trainer/rl/train.py and trainer/sft/train.py opt in.
  • All other setup_logger callers (orchestrator, inference, RL launcher) are unchanged - they're single-process and never had the fan-out.

Notes

  • Uses global RANK (not LOCAL_RANK) so it works correctly on multi-node setups where only one trainer pod has global rank 0.
  • Reads os.environ["RANK"] directly rather than going through torch.distributed.get_rank(), so it works for any log line emitted before dist.init_process_group.

Note

Low Risk
Low risk: changes only logging behavior by silencing non-zero distributed ranks, which could reduce per-rank visibility but does not affect training computation.

Overview
Reduces duplicated logs in distributed training runs. setup_logger now supports rank_zero_only, which detects non-zero RANK (from torchrun) and creates a loguru logger with no sinks so log calls become no-ops on those ranks.

Both RL and SFT trainers opt into this (rank_zero_only=True) to prevent N-way repeated stdout lines when Kubernetes/Loki merges per-rank streams.

Reviewed by Cursor Bugbot for commit e3fe296. Bugbot is set up for automated code reviews on this repo. Configure here.

Every torchrun rank writes to its own stdout, but in k8s all rank
stdouts merge into a single Loki stream grouped by role=trainer. The
dashboard's Trainer log tab then shows N copies of every line (one per
GPU), e.g. 8 identical 'Starting training loop' lines and 8 near-
identical 'Step 0 | ...' lines on an 8-GPU pod.

The existing torchrun --local-ranks-filter plumbing in entrypoints/rl.py
only applies to the launcher-managed single-node path; on k8s the
trainer is invoked directly via torchrun, bypassing that filter.

Fix it at the logger level: add a rank_zero_only flag to setup_logger
that builds a sink-less loguru instance on non-zero global ranks (read
from torchrun's RANK env var, set before dist init). The RL and SFT
trainer entrypoints opt in. Default behavior is unchanged for all other
setup_logger callers (orchestrator, inference, launcher).
@JannikSt JannikSt marked this pull request as ready for review May 19, 2026 00:25
@JannikSt
Copy link
Copy Markdown
Member Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e3fe296b87

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

logger.add(sys.stdout, format=format, level=log_level.upper(), colorize=True)
# Install console handler (enqueue=True only for JSON mode to avoid blocking in async contexts).
# Silent ranks get a logger with no sinks so all log calls become no-ops.
if not is_silent_rank:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep fatal errors visible on non-zero ranks

When rank_zero_only=True is used by the RL/SFT trainers, this branch leaves non-zero ranks with no log sinks at all, so clean_exit's get_logger().opt(exception=True).error(...) becomes a no-op before it calls sys.exit(1). If only rank >0 hits a data/model/distributed error, torchrun will report a child failure but the original traceback is swallowed by clean_exit, making those failures effectively undiagnosable. Consider still emitting error/exception-level logs to stderr on silent ranks or bypassing the no-sink logger for fatal cleanup paths.

Useful? React with 👍 / 👎.

@mikasenghaas
Copy link
Copy Markdown
Member

lol @samsja wants this

@JannikSt
Copy link
Copy Markdown
Member Author

Superseded by #2552 (Sami's launcher-flag approach) — better fix because it preserves per-rank logs on the PVC instead of silencing non-zero ranks entirely. Platform-side change to thread the same --local-ranks-filter=0 --tee=3 --redirect=3 --log-dir=... flags into the chart's torchrun command lives in PrimeIntellect-ai/platform PR #2071 (commit 36320d073).

@JannikSt JannikSt closed this May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants