[train] add TrainControllerState metrics #52805

matthewdeng · 2025-05-06T01:24:42Z

Summary

Refactors out the metrics handling logic into an internal metrics module.
Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics.
Add a Grafana dashboard panel that shows the states.

Refactoring

All the metrics handling logic is now abstracted away in an internal metrics module.

python/ray/train/v2/_internal/metrics
├── __init__.py
├── base.py
├── controller.py
└── worker.py

As a result, the ControllerMetricsCallback and WorkerMetricsCallback can now be thin layers that map the callback events to calls to MetricsTracker.update.

class Metric:
    def start(self):
        ...
    
    def record(self, tags: Dict[str, str], value: T):
        ...
    
    def get_value(self, tags: Dict[str, str]) -> T:
        ...
    
    def reset(self):
        ...
    
    def shutdown(self):
        ...

TrainControllerState

Added a new metric to track TrainControllerState.

Its key is defined ("ray_train_run_name", "ray_train_controller_state"), and this should be taken into consideration when defining the visualization for this metric.

Example

Ran this locally, with a small modification to print out calls to ray.util.metrics.Gauge.set().

Repro script

import time
from ray.train.torch import TorchTrainer

def train_func():
    time.sleep(10)

trainer = TorchTrainer(train_func)
trainer.fit()

Logs

(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'INITIALIZING'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'INITIALIZING'}
(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'SCHEDULING'}
(TrainController pid=39720) Started training worker group of size 1: 
(TrainController pid=39720) - (ip=127.0.0.1, pid=39728) world_rank=0, local_rank=0, node_rank=0
(RayTrainWorker pid=39728) Setting up process group for: env:// [rank=0, world_size=1]
(TrainController pid=39720) SET train_worker_group_start_total_time_s 1.9262119578197598 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'SCHEDULING'}
(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'RUNNING'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'RUNNING'}
(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'FINISHED'}
(TrainController pid=39720) SET train_worker_group_shutdown_total_time_s 0.004093542229384184 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_worker_group_start_total_time_s 0.0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_worker_group_shutdown_total_time_s 0.0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'INITIALIZING', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'SCHEDULING', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'RUNNING', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'FINISHED', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}

Dashboard

Added a new dashboard panel that shows the state.

Note that the dashboard has 15 second increments, so states that are shorter than this period may not show up.

Repro script:

import time

from ray.train import RunConfig
from ray.train.backend import Backend, BackendConfig
from ray.train.v2.api.data_parallel_trainer import DataParallelTrainer


class SlowBackend(Backend):
    def on_start(self, worker_group, backend_config):
        time.sleep(100)

    def on_shutdown(self, worker_group, backend_config):
        time.sleep(100)


class SlowBackendConfig(BackendConfig):
    @property
    def backend_cls(self):
        return SlowBackend


def train_func():
    time.sleep(100)


run_config = RunConfig(name="slow-backend-run")


trainer = DataParallelTrainer(
    train_func, backend_config=SlowBackendConfig(), run_config=run_config
)
trainer.fit()

Signed-off-by: Matthew Deng <[email protected]>

justinvyu

The "additional tags" argument seems like a confusing implementation detail of the state update metric. Wondering if a pattern like this is simpler:

class EnumMetric(Metric):
    def __init__(self, enum_cls):
        self._lock = nullcontext()

    def create_gauge(self):
        self._gauge = Gauge(...)
    
    def set_lock(self, lock):
        self._lock = lock

    def update(self, new_state: Enum):
        with self.lock:
            self._value = new_state

    def push(self):
        with self._lock:
            for option in Enum.options:
                if self._value == option:
                    self._gauge.set(1, {"ray_train_controller_state": self._value})
                else:
                    self._gauge.set(0, {"ray_train_controller_state": self._value})


controller_state_metric = EnumMetric(ControllerState, ...)
tracker = MetricsTracker([controller_state_metric])  # This creates and sets the lock on all metrics passed in
controller_state_metric.update(curr_controller_state)

python/ray/train/v2/_internal/metrics/base.py

justinvyu · 2025-05-07T00:19:14Z

python/ray/train/v2/_internal/metrics/base.py

+            value: The value to update the metric with. The value will be added to the existing value
+                for the metric-tags combination, or set if the metric-tags combination does not exist.


this description doesn't match with what's happening below.

Implicitly adding values for cumulative metrics when you call "update" is a bit misleading.

What about having keeping an "accumulation_fn" in the Metric dataclass (ex: lambda accumulated_val, curr: curr and lambda accumulated_val, curr: accumulated_val + curr). Then, use the metric's accumulation function to update the underlying value.

Ooh this is a great callout. I was overindexing a bit on this particular new metric that I added.

python/ray/train/v2/_internal/metrics/base.py

justinvyu · 2025-05-07T00:46:30Z

python/ray/train/v2/_internal/metrics/controller.py

+    name="train_controller_state",
+    type=int,
+    default=0,
+    description="The current state of the controller",
+    tag_keys=CONTROLLER_TAG_KEYS + (CONTROLLER_STATE_TAG_KEY,),


what's the purpose of these tag keys?

This is meant to validate the tag keys that need to be specified when logging this metric. There's actually validation that happens at the lower level if the tags aren't passed in, but I can add a quick validation at the update/record layer as well so it fast-fails!

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Signed-off-by: Matthew Deng <[email protected]>

justinvyu

Is gauge.set() a very cheap operation? Was there a point to doing the background thread loop in the first place?

python/ray/dashboard/modules/metrics/dashboards/train_dashboard_panels.py

python/ray/train/v2/_internal/metrics/base.py

Signed-off-by: Matthew Deng <[email protected]>

python/ray/train/v2/_internal/metrics/controller.py

python/ray/train/v2/_internal/metrics/base.py

python/ray/train/v2/_internal/metrics/controller.py

python/ray/train/v2/_internal/callbacks/metrics.py

python/ray/train/v2/_internal/metrics/base.py

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Signed-off-by: Matthew Deng <[email protected]>

justinvyu · 2025-05-12T18:30:35Z

python/ray/train/v2/_internal/metrics/base.py

+
+    def reset(self):
+        self._gauge.set(self._default, self._base_tags)
+        self._current_value = 0.0


should we set it to default?

or maybe just remove "Default" from the base class since it's not used there

yes good call, I will remove since Metric is now abstract!

Oh actually I will keep it and convert to use default because I think it's nice to guarantee this in the logic of get_value

python/ray/train/v2/_internal/metrics/base.py

python/ray/train/v2/tests/test_metrics.py

python/ray/train/v2/_internal/metrics/controller.py

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Signed-off-by: Matthew Deng <[email protected]>

justinvyu

🆒

1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: weiran11 <[email protected]>

1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: zhaoch23 <[email protected]>

1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>

1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]>

matthewdeng added 9 commits April 30, 2025 12:44

[train] add initial Train Grafana dashboard

02d5146

Signed-off-by: Matthew Deng <[email protected]>

wip

5a74c2c

Signed-off-by: Matthew Deng <[email protected]>

wip

cbaf2db

Signed-off-by: Matthew Deng <[email protected]>

separate metrics

642f1fe

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into train-metrics

705c4a4

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into train-metrics

1d4a265

Signed-off-by: Matthew Deng <[email protected]>

add tests

4136a8a

Signed-off-by: Matthew Deng <[email protected]>

fix tests

dbfb839

Signed-off-by: Matthew Deng <[email protected]>

remove print

4bb03f1

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng marked this pull request as ready for review May 6, 2025 22:50

matthewdeng requested review from hongpeng-guo, justinvyu, raulchen and woshiyyya as code owners May 6, 2025 22:50

fix import

54f6b80

Signed-off-by: Matthew Deng <[email protected]>

justinvyu reviewed May 7, 2025

View reviewed changes

matthewdeng and others added 8 commits May 6, 2025 18:19

Update python/ray/train/v2/_internal/metrics/base.py

395cd5d

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Update python/ray/train/v2/_internal/metrics/base.py

3d5059a

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

simplify

5581e80

Signed-off-by: Matthew Deng <[email protected]>

fix

617ce21

Signed-off-by: Matthew Deng <[email protected]>

fix

f8c5c0b

Signed-off-by: Matthew Deng <[email protected]>

fix

5d931f2

Signed-off-by: Matthew Deng <[email protected]>

reorder

d52c165

Signed-off-by: Matthew Deng <[email protected]>

add grafana panel

1ed7776

Signed-off-by: Matthew Deng <[email protected]>

justinvyu reviewed May 8, 2025

View reviewed changes

python/ray/dashboard/modules/metrics/dashboards/train_dashboard_panels.py Outdated Show resolved Hide resolved

python/ray/train/v2/_internal/metrics/base.py Outdated Show resolved Hide resolved

matthewdeng added 4 commits May 8, 2025 18:00

clean up

2d3c93a

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into train-metrics

842e23d

Signed-off-by: Matthew Deng <[email protected]>

fix

b28195b

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into train-metrics

8c400c2

Signed-off-by: Matthew Deng <[email protected]>

justinvyu reviewed May 9, 2025

View reviewed changes

matthewdeng and others added 10 commits May 9, 2025 11:47

Update python/ray/train/v2/_internal/metrics/controller.py

15b86ef

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Update python/ray/train/v2/_internal/metrics/controller.py

6944b57

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Update python/ray/train/v2/_internal/metrics/controller.py

95b247f

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Update python/ray/train/v2/_internal/metrics/controller.py

355efdd

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Update python/ray/train/v2/_internal/metrics/controller.py

581938a

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

Update python/ray/train/v2/_internal/metrics/controller.py

1e721f1

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

address feedback

6be0cd2

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into train-metrics

08b91ac

Signed-off-by: Matthew Deng <[email protected]>

can we get any simpler

17043ab

Signed-off-by: Matthew Deng <[email protected]>

bump

2c0bb89

Signed-off-by: Matthew Deng <[email protected]>

matthewdeng added the go add ONLY when ready to merge, run all tests label May 12, 2025

justinvyu reviewed May 12, 2025

View reviewed changes

matthewdeng and others added 3 commits May 12, 2025 12:06

Apply suggestions from code review

2c6526a

Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>

address feedback

5bfaaa5

Signed-off-by: Matthew Deng <[email protected]>

Merge branch 'master' of github.com:ray-project/ray into train-metrics

33e667b

Signed-off-by: Matthew Deng <[email protected]>

justinvyu approved these changes May 12, 2025

View reviewed changes

justinvyu merged commit 491b4c8 into ray-project:master May 12, 2025
5 checks passed

hainesmichaelc added the community-backlog label May 22, 2025

		value: The value to update the metric with. The value will be added to the existing value
		for the metric-tags combination, or set if the metric-tags combination does not exist.

[train] add TrainControllerState metrics #52805

[train] add TrainControllerState metrics #52805

Uh oh!

Conversation

matthewdeng commented May 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Refactoring

TrainControllerState

Example

Dashboard

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

matthewdeng commented May 6, 2025 •

edited

Loading