feat: add multi-node WandB system metrics aggregation by penfever · Pull Request #1358 · NovaSky-AI/SkyRL

penfever · 2026-03-20T13:53:09Z

Summary

In multi-node training, GPU utilization and system metrics are only captured for the head node
Adds WandbNodeLogger Ray actor spawned on each worker node, using wandb mode="shared" to aggregate system metrics from all nodes into a single wandb run
Single-node training is unaffected — the WandbNodeLogger only spawns when Ray detects multiple alive nodes

Test plan

Single-node: verify wandb init still works, no errors from WandbNodeLogger code path
Multi-node: verify system metrics panels appear for all nodes in the wandb run
Verify no import errors when ray is not initialized (e.g., local testing)

🤖 Generated with Claude Code

gemini-code-assist

Code Review

This pull request introduces multi-node WandB system metrics aggregation using Ray actors. The changes correctly identify head and worker nodes, initializing WandB in 'shared' mode with appropriate primary and label settings. The WandbNodeLogger actor is designed to run on worker nodes to collect system metrics. Overall, the implementation aligns with the stated goal of aggregating system metrics from all nodes into a single WandB run.

gemini-code-assist · 2026-03-20T13:55:03Z

skyrl/train/utils/tracking.py

+
+                try:
+                    logger_actor = WandbNodeLogger.options(
+                        num_cpus=0,


Setting num_cpus=0 for the WandbNodeLogger actor might prevent it from effectively collecting and reporting system metrics, as wandb's internal processes might require some CPU cycles. While the intention is to minimize resource usage, a small non-zero value (e.g., 0.1 or 0.01) is generally recommended for actors that perform background tasks to ensure they receive enough CPU time to function correctly.

Suggested change

num_cpus=0,

num_cpus=0.1,

gemini-code-assist · 2026-03-20T13:55:03Z

skyrl/train/utils/tracking.py

+                x_label=x_label,
+            ),
+        )
+        self.wandb = run


To ensure proper resource cleanup and to explicitly terminate the WandB run on worker nodes, consider adding a finish_run method to the WandbNodeLogger class. This method could call self.wandb.finish() and then be invoked from the main Tracking class's finish or __del__ methods for each remote_logger_actor. This ensures all metrics are properly synced and resources are released when the training run concludes.

Suggested change

self.wandb = run

self.wandb = run

def finish_run(self):

if self.wandb:

self.wandb.finish()

self.wandb = None

gemini-code-assist · 2026-03-20T13:55:03Z

skyrl/train/utils/tracking.py

+
+        try:
+            nodes = ray.nodes()
+            self.remote_loggers = []


The self.remote_loggers list stores references to the WandbNodeLogger Ray actors. It's important to manage the lifecycle of these actors to ensure they are properly shut down and release resources when the main Tracking instance is no longer needed. Consider adding logic in the Tracking class's finish or __del__ method to iterate through self.remote_loggers and call a remote finish_run method on each actor (after implementing finish_run in WandbNodeLogger).

In multi-node training, GPU utilization and system metrics are only captured for the head node. This adds a WandbNodeLogger Ray actor spawned on each worker node, using wandb mode="shared" to aggregate system metrics (GPU util, memory) from all nodes into a single run. Single-node training is unaffected — the WandbNodeLogger only spawns when Ray detects multiple alive nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist bot reviewed Mar 20, 2026

View reviewed changes

This comment was marked as resolved.

Sign in to view

penfever force-pushed the penfever/multi-node-wandb-metrics branch from 03e2b87 to 21f2646 Compare March 20, 2026 14:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add multi-node WandB system metrics aggregation#1358

feat: add multi-node WandB system metrics aggregation#1358
penfever wants to merge 1 commit intoNovaSky-AI:mainfrom
penfever:penfever/multi-node-wandb-metrics

penfever commented Mar 20, 2026 •

edited by devin-ai-integration bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

gemini-code-assist bot Mar 20, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-        self.wandb = run
+        self.wandb = run
+    def finish_run(self):
+        if self.wandb:
+            self.wandb.finish()
+            self.wandb = None

Conversation

penfever commented Mar 20, 2026 • edited by devin-ai-integration bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

penfever commented Mar 20, 2026 •

edited by devin-ai-integration bot

Loading