WIP: fix two DDP wandb logging bugs (duplicate runs + dropped eval metrics) by eugenevinitsky · Pull Request #468 · Emerge-Lab/PufferDrive

eugenevinitsky · 2026-06-02T11:37:39Z

WIP. Two independent DDP-only logging bugs surfaced while running the multi-agent nightly on 4 GPUs.

1. Duplicate wandb runs (one per rank)

Logger creation in train() was ungated, so under torchrun every rank called wandb.init() → world_size duplicate runs in the same group. Now only rank 0 builds the logger; other ranks keep logger=None (PuffeRL wraps that in NoLogger). Eval was already rank-0-only, so nothing else changes.

2. Eval metrics silently dropped by wandb

Training logs at agent_steps = dist_sum(global_step) (summed across ranks); eval logged at the raw per-rank global_step, which is world_size× smaller. wandb rejects every eval log as non-monotonic

1. Only rank 0 creates the run logger. Logger creation was ungated, so under torchrun every rank called wandb.init()/NeptuneLogger and produced world_size duplicate runs. Non-rank-0 ranks now keep logger=None (PuffeRL wraps it in a NoLogger). 2. Eval logs at the aggregate step, not the per-rank one. Training logs at agent_steps = dist_sum(global_step) (summed across ranks) while eval logged at the raw per-rank global_step, which is world_size x smaller. wandb dropped the eval metrics as non-monotonic ("step ... less than current step ... ignored"), so validation metrics never showed up. Stash agent_steps in mean_and_log and log eval at it (both the in-loop and final force=True maybe_run). Single-GPU is unchanged since dist_sum returns the raw value when not distributed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

Fixes two DDP-only logging bugs: (1) every rank was creating its own wandb/Neptune/TB run, producing world_size duplicate runs per training job, and (2) eval metrics were being silently rejected by wandb because eval logged at the raw per-rank global_step while training logged at the rank-summed agent_steps, making the eval step non-monotonic.

Changes:

Gate logger construction in train() to rank 0 only; non-rank-0 ranks pass logger=None and are wrapped in NoLogger by PuffeRL.
Cache the agent_steps (rank-summed global_step) computed inside mean_and_log on self.agent_steps, and use it as the step for both the in-loop and final force=True _eval_manager.maybe_run calls.
Initialize self.agent_steps = 0 in PuffeRL.__init__.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings June 2, 2026 11:37

Copilot started reviewing on behalf of eugenevinitsky June 2, 2026 11:37 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

vcharraut approved these changes Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: fix two DDP wandb logging bugs (duplicate runs + dropped eval metrics)#468

WIP: fix two DDP wandb logging bugs (duplicate runs + dropped eval metrics)#468
eugenevinitsky wants to merge 1 commit into
emerge/temp_trainingfrom
ev/fix-ddp-wandb-logging

eugenevinitsky commented Jun 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

eugenevinitsky commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Duplicate wandb runs (one per rank)

2. Eval metrics silently dropped by wandb

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eugenevinitsky commented Jun 2, 2026 •

edited

Loading