fix(trainer): suppress duplicated trainer logs on non-zero ranks#2550
fix(trainer): suppress duplicated trainer logs on non-zero ranks#2550JannikSt wants to merge 1 commit into
Conversation
Every torchrun rank writes to its own stdout, but in k8s all rank stdouts merge into a single Loki stream grouped by role=trainer. The dashboard's Trainer log tab then shows N copies of every line (one per GPU), e.g. 8 identical 'Starting training loop' lines and 8 near- identical 'Step 0 | ...' lines on an 8-GPU pod. The existing torchrun --local-ranks-filter plumbing in entrypoints/rl.py only applies to the launcher-managed single-node path; on k8s the trainer is invoked directly via torchrun, bypassing that filter. Fix it at the logger level: add a rank_zero_only flag to setup_logger that builds a sink-less loguru instance on non-zero global ranks (read from torchrun's RANK env var, set before dist init). The RL and SFT trainer entrypoints opt in. Default behavior is unchanged for all other setup_logger callers (orchestrator, inference, launcher).
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e3fe296b87
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logger.add(sys.stdout, format=format, level=log_level.upper(), colorize=True) | ||
| # Install console handler (enqueue=True only for JSON mode to avoid blocking in async contexts). | ||
| # Silent ranks get a logger with no sinks so all log calls become no-ops. | ||
| if not is_silent_rank: |
There was a problem hiding this comment.
Keep fatal errors visible on non-zero ranks
When rank_zero_only=True is used by the RL/SFT trainers, this branch leaves non-zero ranks with no log sinks at all, so clean_exit's get_logger().opt(exception=True).error(...) becomes a no-op before it calls sys.exit(1). If only rank >0 hits a data/model/distributed error, torchrun will report a child failure but the original traceback is swallowed by clean_exit, making those failures effectively undiagnosable. Consider still emitting error/exception-level logs to stderr on silent ranks or bypassing the no-sink logger for fatal cleanup paths.
Useful? React with 👍 / 👎.
|
lol @samsja wants this |
|
Superseded by #2552 (Sami's launcher-flag approach) — better fix because it preserves per-rank logs on the PVC instead of silencing non-zero ranks entirely. Platform-side change to thread the same |
Why
On an 8-GPU trainer pod, the dashboard's Trainer log tab on
/dashboard/training/<id>shows every log line 8 times - e.g. 8 identical "Starting training loop" lines and 8 near-identical "Step 0 | Time ..." lines (each with slightly different per-rank throughput/memory).torchrun --nproc-per-node=Nspawns N processes, each writes to stdout, k8s ships each stdout to Loki as separate streams, and the platform groups byrole=trainer. The existing--local-ranks-filterplumbing inentrypoints/rl.pyonly applies to the launcher-managed single-node path; on k8s the trainer module is invoked directly via torchrun, bypassing that filter.What
setup_loggergains arank_zero_onlyflag. WhenTrueandRANK > 0(read from torchrun's env var, set before dist init), the loguru instance is built with no sinks so everylogger.info/.success/.debug/.warningcall becomes a no-op on non-zero ranks.trainer/rl/train.pyandtrainer/sft/train.pyopt in.setup_loggercallers (orchestrator, inference, RL launcher) are unchanged - they're single-process and never had the fan-out.Notes
RANK(notLOCAL_RANK) so it works correctly on multi-node setups where only one trainer pod has global rank 0.os.environ["RANK"]directly rather than going throughtorch.distributed.get_rank(), so it works for any log line emitted beforedist.init_process_group.Note
Low Risk
Low risk: changes only logging behavior by silencing non-zero distributed ranks, which could reduce per-rank visibility but does not affect training computation.
Overview
Reduces duplicated logs in distributed training runs.
setup_loggernow supportsrank_zero_only, which detects non-zeroRANK(fromtorchrun) and creates a loguru logger with no sinks so log calls become no-ops on those ranks.Both RL and SFT trainers opt into this (
rank_zero_only=True) to prevent N-way repeated stdout lines when Kubernetes/Loki merges per-rank streams.Reviewed by Cursor Bugbot for commit e3fe296. Bugbot is set up for automated code reviews on this repo. Configure here.