feat: add multi-node WandB system metrics aggregation#1358
feat: add multi-node WandB system metrics aggregation#1358penfever wants to merge 1 commit intoNovaSky-AI:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces multi-node WandB system metrics aggregation using Ray actors. The changes correctly identify head and worker nodes, initializing WandB in 'shared' mode with appropriate primary and label settings. The WandbNodeLogger actor is designed to run on worker nodes to collect system metrics. Overall, the implementation aligns with the stated goal of aggregating system metrics from all nodes into a single WandB run.
skyrl/train/utils/tracking.py
Outdated
|
|
||
| try: | ||
| logger_actor = WandbNodeLogger.options( | ||
| num_cpus=0, |
There was a problem hiding this comment.
Setting num_cpus=0 for the WandbNodeLogger actor might prevent it from effectively collecting and reporting system metrics, as wandb's internal processes might require some CPU cycles. While the intention is to minimize resource usage, a small non-zero value (e.g., 0.1 or 0.01) is generally recommended for actors that perform background tasks to ensure they receive enough CPU time to function correctly.
| num_cpus=0, | |
| num_cpus=0.1, |
| x_label=x_label, | ||
| ), | ||
| ) | ||
| self.wandb = run |
There was a problem hiding this comment.
To ensure proper resource cleanup and to explicitly terminate the WandB run on worker nodes, consider adding a finish_run method to the WandbNodeLogger class. This method could call self.wandb.finish() and then be invoked from the main Tracking class's finish or __del__ methods for each remote_logger_actor. This ensures all metrics are properly synced and resources are released when the training run concludes.
| self.wandb = run | |
| self.wandb = run | |
| def finish_run(self): | |
| if self.wandb: | |
| self.wandb.finish() | |
| self.wandb = None |
skyrl/train/utils/tracking.py
Outdated
|
|
||
| try: | ||
| nodes = ray.nodes() | ||
| self.remote_loggers = [] |
There was a problem hiding this comment.
The self.remote_loggers list stores references to the WandbNodeLogger Ray actors. It's important to manage the lifecycle of these actors to ensure they are properly shut down and release resources when the main Tracking instance is no longer needed. Consider adding logic in the Tracking class's finish or __del__ method to iterate through self.remote_loggers and call a remote finish_run method on each actor (after implementing finish_run in WandbNodeLogger).
In multi-node training, GPU utilization and system metrics are only captured for the head node. This adds a WandbNodeLogger Ray actor spawned on each worker node, using wandb mode="shared" to aggregate system metrics (GPU util, memory) from all nodes into a single run. Single-node training is unaffected — the WandbNodeLogger only spawns when Ray detects multiple alive nodes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
03e2b87 to
21f2646
Compare
Summary
WandbNodeLoggerRay actor spawned on each worker node, using wandbmode="shared"to aggregate system metrics from all nodes into a single wandb runWandbNodeLoggeronly spawns when Ray detects multiple alive nodesTest plan
🤖 Generated with Claude Code