Skip to content

[train] add TrainControllerState metrics #52805

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
May 12, 2025

Conversation

matthewdeng
Copy link
Contributor

@matthewdeng matthewdeng commented May 6, 2025

Summary

  1. Refactors out the metrics handling logic into an internal metrics module.
  2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics.
  3. Add a Grafana dashboard panel that shows the states.

Refactoring

All the metrics handling logic is now abstracted away in an internal metrics module.

python/ray/train/v2/_internal/metrics
├── __init__.py
├── base.py
├── controller.py
└── worker.py

As a result, the ControllerMetricsCallback and WorkerMetricsCallback can now be thin layers that map the callback events to calls to MetricsTracker.update.

class Metric:
    def start(self):
        ...
    
    def record(self, tags: Dict[str, str], value: T):
        ...
    
    def get_value(self, tags: Dict[str, str]) -> T:
        ...
    
    def reset(self):
        ...
    
    def shutdown(self):
        ...

TrainControllerState

Added a new metric to track TrainControllerState.

Its key is defined ("ray_train_run_name", "ray_train_controller_state"), and this should be taken into consideration when defining the visualization for this metric.

Example

Ran this locally, with a small modification to print out calls to ray.util.metrics.Gauge.set().

Repro script

import time
from ray.train.torch import TorchTrainer

def train_func():
    time.sleep(10)

trainer = TorchTrainer(train_func)
trainer.fit()

Logs

(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'INITIALIZING'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'INITIALIZING'}
(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'SCHEDULING'}
(TrainController pid=39720) Started training worker group of size 1: 
(TrainController pid=39720) - (ip=127.0.0.1, pid=39728) world_rank=0, local_rank=0, node_rank=0
(RayTrainWorker pid=39728) Setting up process group for: env:// [rank=0, world_size=1]
(TrainController pid=39720) SET train_worker_group_start_total_time_s 1.9262119578197598 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'SCHEDULING'}
(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'RUNNING'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'RUNNING'}
(TrainController pid=39720) SET train_controller_state 1 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36', 'ray_train_controller_state': 'FINISHED'}
(TrainController pid=39720) SET train_worker_group_shutdown_total_time_s 0.004093542229384184 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_worker_group_start_total_time_s 0.0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_worker_group_shutdown_total_time_s 0.0 {'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'INITIALIZING', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'SCHEDULING', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'RUNNING', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}
(TrainController pid=39720) SET train_controller_state 0 {'ray_train_controller_state': 'FINISHED', 'ray_train_run_name': 'ray_train_run-2025-05-07_17-56-36'}

Dashboard

Added a new dashboard panel that shows the state.

Note that the dashboard has 15 second increments, so states that are shorter than this period may not show up.
image

Repro script:

import time

from ray.train import RunConfig
from ray.train.backend import Backend, BackendConfig
from ray.train.v2.api.data_parallel_trainer import DataParallelTrainer


class SlowBackend(Backend):
    def on_start(self, worker_group, backend_config):
        time.sleep(100)

    def on_shutdown(self, worker_group, backend_config):
        time.sleep(100)


class SlowBackendConfig(BackendConfig):
    @property
    def backend_cls(self):
        return SlowBackend


def train_func():
    time.sleep(100)


run_config = RunConfig(name="slow-backend-run")


trainer = DataParallelTrainer(
    train_func, backend_config=SlowBackendConfig(), run_config=run_config
)
trainer.fit()

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
@matthewdeng matthewdeng marked this pull request as ready for review May 6, 2025 22:50
Signed-off-by: Matthew Deng <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "additional tags" argument seems like a confusing implementation detail of the state update metric. Wondering if a pattern like this is simpler:

class EnumMetric(Metric):
    def __init__(self, enum_cls):
        self._lock = nullcontext()

    def create_gauge(self):
        self._gauge = Gauge(...)
    
    def set_lock(self, lock):
        self._lock = lock

    def update(self, new_state: Enum):
        with self.lock:
            self._value = new_state

    def push(self):
        with self._lock:
            for option in Enum.options:
                if self._value == option:
                    self._gauge.set(1, {"ray_train_controller_state": self._value})
                else:
                    self._gauge.set(0, {"ray_train_controller_state": self._value})


controller_state_metric = EnumMetric(ControllerState, ...)
tracker = MetricsTracker([controller_state_metric])  # This creates and sets the lock on all metrics passed in
controller_state_metric.update(curr_controller_state)

Comment on lines 55 to 56
value: The value to update the metric with. The value will be added to the existing value
for the metric-tags combination, or set if the metric-tags combination does not exist.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this description doesn't match with what's happening below.

Implicitly adding values for cumulative metrics when you call "update" is a bit misleading.

What about having keeping an "accumulation_fn" in the Metric dataclass (ex: lambda accumulated_val, curr: curr and lambda accumulated_val, curr: accumulated_val + curr). Then, use the metric's accumulation function to update the underlying value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh this is a great callout. I was overindexing a bit on this particular new metric that I added.

Comment on lines 30 to 34
name="train_controller_state",
type=int,
default=0,
description="The current state of the controller",
tag_keys=CONTROLLER_TAG_KEYS + (CONTROLLER_STATE_TAG_KEY,),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the purpose of these tag keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is meant to validate the tag keys that need to be specified when logging this metric. There's actually validation that happens at the lower level if the tags aren't passed in, but I can add a quick validation at the update/record layer as well so it fast-fails!

matthewdeng and others added 8 commits May 6, 2025 18:19
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is gauge.set() a very cheap operation? Was there a point to doing the background thread loop in the first place?

@matthewdeng matthewdeng added the go add ONLY when ready to merge, run all tests label May 12, 2025

def reset(self):
self._gauge.set(self._default, self._base_tags)
self._current_value = 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we set it to default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe just remove "Default" from the base class since it's not used there

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good call, I will remove since Metric is now abstract!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh actually I will keep it and convert to use default because I think it's nice to guarantee this in the logic of get_value

matthewdeng and others added 3 commits May 12, 2025 12:06
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🆒

@justinvyu justinvyu merged commit 491b4c8 into ray-project:master May 12, 2025
5 checks passed
ran1995data pushed a commit to ran1995data/ray that referenced this pull request May 13, 2025
1. Refactors out the metrics handling logic into an internal metrics
module.
2. Implements a new metric for tracking TrainController state over time,
similar to Ray Core's Task/Actor state metrics.
3. Add a Grafana dashboard panel that shows the states.

All the metrics handling logic is now abstracted away in an internal
`metrics` module.
As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback`
can now be thin layers that map the callback events to calls to
`MetricsTracker.update`.

---------

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
Signed-off-by: weiran11 <[email protected]>
zhaoch23 pushed a commit to Bye-legumes/ray that referenced this pull request May 14, 2025
1. Refactors out the metrics handling logic into an internal metrics
module.
2. Implements a new metric for tracking TrainController state over time,
similar to Ray Core's Task/Actor state metrics.
3. Add a Grafana dashboard panel that shows the states.

All the metrics handling logic is now abstracted away in an internal
`metrics` module.
As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback`
can now be thin layers that map the callback events to calls to
`MetricsTracker.update`.

---------

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
Signed-off-by: zhaoch23 <[email protected]>
iamjustinhsu pushed a commit to iamjustinhsu/ray that referenced this pull request May 15, 2025
1. Refactors out the metrics handling logic into an internal metrics
module.
2. Implements a new metric for tracking TrainController state over time,
similar to Ray Core's Task/Actor state metrics.
3. Add a Grafana dashboard panel that shows the states.

All the metrics handling logic is now abstracted away in an internal
`metrics` module.
As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback`
can now be thin layers that map the callback events to calls to
`MetricsTracker.update`.

---------

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
Signed-off-by: iamjustinhsu <[email protected]>
lk-chen pushed a commit to lk-chen/ray that referenced this pull request May 17, 2025
1. Refactors out the metrics handling logic into an internal metrics
module.
2. Implements a new metric for tracking TrainController state over time,
similar to Ray Core's Task/Actor state metrics.
3. Add a Grafana dashboard panel that shows the states.

All the metrics handling logic is now abstracted away in an internal
`metrics` module.
As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback`
can now be thin layers that map the callback events to calls to
`MetricsTracker.update`.

---------

Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-backlog go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants