Skip to content

feat: add multi-node WandB system metrics aggregation#1358

Open
penfever wants to merge 1 commit intoNovaSky-AI:mainfrom
penfever:penfever/multi-node-wandb-metrics
Open

feat: add multi-node WandB system metrics aggregation#1358
penfever wants to merge 1 commit intoNovaSky-AI:mainfrom
penfever:penfever/multi-node-wandb-metrics

Conversation

@penfever
Copy link
Copy Markdown

@penfever penfever commented Mar 20, 2026

Summary

  • In multi-node training, GPU utilization and system metrics are only captured for the head node
  • Adds WandbNodeLogger Ray actor spawned on each worker node, using wandb mode="shared" to aggregate system metrics from all nodes into a single wandb run
  • Single-node training is unaffected — the WandbNodeLogger only spawns when Ray detects multiple alive nodes

Test plan

  • Single-node: verify wandb init still works, no errors from WandbNodeLogger code path
  • Multi-node: verify system metrics panels appear for all nodes in the wandb run
  • Verify no import errors when ray is not initialized (e.g., local testing)

🤖 Generated with Claude Code


Open with Devin

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces multi-node WandB system metrics aggregation using Ray actors. The changes correctly identify head and worker nodes, initializing WandB in 'shared' mode with appropriate primary and label settings. The WandbNodeLogger actor is designed to run on worker nodes to collect system metrics. Overall, the implementation aligns with the stated goal of aggregating system metrics from all nodes into a single WandB run.


try:
logger_actor = WandbNodeLogger.options(
num_cpus=0,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Setting num_cpus=0 for the WandbNodeLogger actor might prevent it from effectively collecting and reporting system metrics, as wandb's internal processes might require some CPU cycles. While the intention is to minimize resource usage, a small non-zero value (e.g., 0.1 or 0.01) is generally recommended for actors that perform background tasks to ensure they receive enough CPU time to function correctly.

Suggested change
num_cpus=0,
num_cpus=0.1,

x_label=x_label,
),
)
self.wandb = run
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure proper resource cleanup and to explicitly terminate the WandB run on worker nodes, consider adding a finish_run method to the WandbNodeLogger class. This method could call self.wandb.finish() and then be invoked from the main Tracking class's finish or __del__ methods for each remote_logger_actor. This ensures all metrics are properly synced and resources are released when the training run concludes.

Suggested change
self.wandb = run
self.wandb = run
def finish_run(self):
if self.wandb:
self.wandb.finish()
self.wandb = None


try:
nodes = ray.nodes()
self.remote_loggers = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The self.remote_loggers list stores references to the WandbNodeLogger Ray actors. It's important to manage the lifecycle of these actors to ensure they are properly shut down and release resources when the main Tracking instance is no longer needed. Consider adding logic in the Tracking class's finish or __del__ method to iterate through self.remote_loggers and call a remote finish_run method on each actor (after implementing finish_run in WandbNodeLogger).

devin-ai-integration[bot]

This comment was marked as resolved.

In multi-node training, GPU utilization and system metrics are only
captured for the head node. This adds a WandbNodeLogger Ray actor
spawned on each worker node, using wandb mode="shared" to aggregate
system metrics (GPU util, memory) from all nodes into a single run.

Single-node training is unaffected — the WandbNodeLogger only spawns
when Ray detects multiple alive nodes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@penfever penfever force-pushed the penfever/multi-node-wandb-metrics branch from 03e2b87 to 21f2646 Compare March 20, 2026 14:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant