-
Notifications
You must be signed in to change notification settings - Fork 6.3k
[train] add TrainControllerState metrics #52805
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "additional tags" argument seems like a confusing implementation detail of the state update metric. Wondering if a pattern like this is simpler:
class EnumMetric(Metric):
def __init__(self, enum_cls):
self._lock = nullcontext()
def create_gauge(self):
self._gauge = Gauge(...)
def set_lock(self, lock):
self._lock = lock
def update(self, new_state: Enum):
with self.lock:
self._value = new_state
def push(self):
with self._lock:
for option in Enum.options:
if self._value == option:
self._gauge.set(1, {"ray_train_controller_state": self._value})
else:
self._gauge.set(0, {"ray_train_controller_state": self._value})
controller_state_metric = EnumMetric(ControllerState, ...)
tracker = MetricsTracker([controller_state_metric]) # This creates and sets the lock on all metrics passed in
controller_state_metric.update(curr_controller_state)
value: The value to update the metric with. The value will be added to the existing value | ||
for the metric-tags combination, or set if the metric-tags combination does not exist. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this description doesn't match with what's happening below.
Implicitly adding values for cumulative metrics when you call "update" is a bit misleading.
What about having keeping an "accumulation_fn" in the Metric
dataclass (ex: lambda accumulated_val, curr: curr
and lambda accumulated_val, curr: accumulated_val + curr
). Then, use the metric's accumulation function to update the underlying value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh this is a great callout. I was overindexing a bit on this particular new metric that I added.
name="train_controller_state", | ||
type=int, | ||
default=0, | ||
description="The current state of the controller", | ||
tag_keys=CONTROLLER_TAG_KEYS + (CONTROLLER_STATE_TAG_KEY,), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the purpose of these tag keys?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is meant to validate the tag keys that need to be specified when logging this metric. There's actually validation that happens at the lower level if the tags aren't passed in, but I can add a quick validation at the update
/record
layer as well so it fast-fails!
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is gauge.set() a very cheap operation? Was there a point to doing the background thread loop in the first place?
python/ray/dashboard/modules/metrics/dashboards/train_dashboard_panels.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
|
||
def reset(self): | ||
self._gauge.set(self._default, self._base_tags) | ||
self._current_value = 0.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we set it to default?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe just remove "Default" from the base class since it's not used there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes good call, I will remove since Metric
is now abstract!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh actually I will keep it and convert to use default because I think it's nice to guarantee this in the logic of get_value
Co-authored-by: Justin Yu <[email protected]> Signed-off-by: matthewdeng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
Signed-off-by: Matthew Deng <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🆒
1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: weiran11 <[email protected]>
1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: zhaoch23 <[email protected]>
1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]> Signed-off-by: iamjustinhsu <[email protected]>
1. Refactors out the metrics handling logic into an internal metrics module. 2. Implements a new metric for tracking TrainController state over time, similar to Ray Core's Task/Actor state metrics. 3. Add a Grafana dashboard panel that shows the states. All the metrics handling logic is now abstracted away in an internal `metrics` module. As a result, the `ControllerMetricsCallback` and `WorkerMetricsCallback` can now be thin layers that map the callback events to calls to `MetricsTracker.update`. --------- Signed-off-by: Matthew Deng <[email protected]> Signed-off-by: matthewdeng <[email protected]> Co-authored-by: Justin Yu <[email protected]>
Summary
Refactoring
All the metrics handling logic is now abstracted away in an internal
metrics
module.As a result, the
ControllerMetricsCallback
andWorkerMetricsCallback
can now be thin layers that map the callback events to calls toMetricsTracker.update
.TrainControllerState
Added a new metric to track
TrainControllerState
.Its key is defined ("ray_train_run_name", "ray_train_controller_state"), and this should be taken into consideration when defining the visualization for this metric.
Example
Ran this locally, with a small modification to print out calls to
ray.util.metrics.Gauge.set()
.Repro script
Logs
Dashboard
Added a new dashboard panel that shows the state.
Note that the dashboard has 15 second increments, so states that are shorter than this period may not show up.

Repro script: