Introduce Unified Metrics & Health-Check HTTP Endpoints for Dynamo Components #9

ZichengMa · 2025-06-06T23:55:39Z

This PR adds a component-level observability framework by embedding a lightweight HTTP server into each Dynamo service. It delivers:

Unified HTTP Interface: Exposes both metrics and health probes over a single port per component, simplifying deployment and operations.
Component-Level Metrics Migration: Currently @tedzhouhk implemented a rust frontend metrics collection. This PR may migrate the implementation. For other component, what metrics should be monitored still needs discussion.
Health-Check Requirements: Defines appropriate probes and checks to support Kubernetes-style deployment patterns.
Python & Rust Bindings: Provides APIs for registering custom health functions and configuring response thresholds programmatically.

deps/0000-metrics-health-check.md

itay

Great start to the doc, left some comments - I think seeing that picture of what a fully deployed graph looks like would be very helpful.

deps/0000-metrics-health-check.md

itay · 2025-06-07T20:58:41Z

deps/0000-metrics-health-check.md

+
+The system **MUST** include a unified HTTP endpoint infrastructure for Dynamo components to expose metrics and health check endpoints
+
+### REQ 2 Performance Mrtics Requirements for worker nodes


For the workers (i.e. the frameworks), we need to also have a requirement to allow us to grab the metrics that those frameworks natively expose. You note below the ones we want to capture ourselves from the frameworks (e.g. TTFT, etc), but the frameworks have extensive metrics that should be collectable. We need to understand how this is going to work (e.g. are they going to expose it independently, are we going bring those into our metrics, etc).

@ZichengMa - @alec-flowers and I have been chatting about this. Right now we have an implementation that works for VLLM but does not for SGLang. We've been thinking about standardizing on prometheus for this. Happy to iterate

Is the addition here to the proposal then that the worker component that wraps the framework will aggregate metrics from the framework and any other additional metrics into one endpoint? I think that makes sense in general - and we can add that here.

Agreed, the rust frontend's metrics are deployment-level. We also need worker-level metrics. The challenge here is that worker-level metrics may come from the backend itself instead of dynamo scripts.

deps/0000-metrics-health-check.md

itay · 2025-06-07T21:00:56Z

deps/0000-metrics-health-check.md

+- `/liveness` - Component liveness probe
+- `/readiness` - Component readiness probe
+
+### REQ 4 Core Health Check Implementation


We need to have a description on what is the acceptable latency range for the healthcheck endpoints, otherwise we risk more and more "checks" being stuffed into them.

I think the latency range should be handled on k8s controller side? Let controller determines what is an appropriate range?

it is configurable at the cluster level and per pod.

deps/0000-metrics-health-check.md

itay · 2025-06-07T21:02:46Z

deps/0000-metrics-health-check.md

+
+
+
+# Architecture


What I'd like to see is an overall diagram, let's make it Kubernetes-oriented, that shows a canonical graph (e.g. a disagg one) with the individual pods, and within them the components that are running, and the HTTP port/endpoints that are exposed.

itay · 2025-06-07T21:04:40Z

deps/0000-metrics-health-check.md

+
+The proposed solution consists of three main components:
+
+1. **Unified HTTP Server Port**: Each Dynamo component will embed a single HTTP server that provides a unified interface for both metrics exposure and health check endpoints, eliminating the need for multiple ports or separate servers per component.


You say single HTTP server, but in the case of a component (e.g. the frontend) that has its own independent HTTP server, I imagine that it would be independent of that?

Specifically, the metric/healthcheck HTTP server should ideally sit at least on an independent port (ideally bound to only 127.0.0.1) from the core server, so as to ensure that these endpoints are not exposed externally.

so long as the fabric, assuming that there is one, can access the necessary endpoints to accurate determine health and readiness state, I'm fine with this.

I do worry (because a lot of things tend be Python based) that additional endpoints could end up incurring addition processes and thereby significant, unnecessary overhead. I'd like to see a commitment to avoid these issues, even if that means the core implementation is not in Python.

What's an example - that because we start a second process or a second HTTP service that we're consuming more RAM/CPU? Just making sure I understand.

the proposal here - and maybe needs to be flushed out -is that we would be reusing the rust based http server.

In terms of the frontend specifically - I think that depends a little on if we consider the frontend to be externally exposed or always behind an ingress ...

It might be simpler for us to define every dynamo component to have a single server with multiple endpoints (health/metrics/functions) - and then for a deployment the frontend would be behind an ingress to handle all auth / security related things

What's an example - that because we start a second process or a second HTTP service that we're consuming more RAM/CPU? Just making sure I understand.

Every process has it's own memory footprint and any interaction between processes has to be done via shared memory (more footprint). Additionally, the OS has to manage thread and hardware congestion of multiple processes in a "fair scheduler" manner instead of the more efficient "in-process" ordering that occurs.

When it's the "right answer", multi-process is great but it's not free. When it is "just another answer" the overhead needs to balanced w/ the benefits. That's all I am saying.

The reason I mentioned Python is because Python doesn't support multi-threading, and instead resorts to multi-process (w/ shared memory, etc.) to affect parallel operations.

deps/0000-metrics-health-check.md

itay · 2025-06-07T21:06:20Z

deps/0000-metrics-health-check.md

+
+To be done
+
+## Metrics Architecture


How are metrics collected from within the component, e.g. if you need to move some metrics from the Python side? If I have a custom component that has custom metrics, how do I easily get those exposed?

whoisj

Please change the PR title to something meaningful and provide a reasonable description of the PR.

Github won't allow me to resolve the request for title and description updates.

rmccorm4 · 2025-06-09T18:45:28Z

deps/0000-metrics-health-check.md

+
+# Summary
+
+This proposal introduces a unified HTTP endpoint infrastructure for Dynamo components, enabling comprehensive observability and monitoring capabilities at the component level. The core design centers around embedding an HTTP server within each Dynamo component to expose standardized endpoints for both metrics collection and health monitoring. This approach migrates the existing metrics monitoring system from Hongkaun's [implementation](https://github.com/ai-dynamo/dynamo/pull/1315), while simultaneously introducing robust health check mechanisms including liveness, readiness, and custom health probes. Currently, the metrics collection is implemented in the Rust frontend with Prometheus integration, but lacks a unified approach across all components so we need to migrate to a component-level HTTP endpoint approach.


Just curious - have we considered using dynamo runtime concepts like endpoint that is auto attached (or customizable) to every worker?

Instead of an HTTP server spawned per worker, we instead just have another endpoint with something like this:

@dynamo_endpoint def health(): return True

What are the pros/cons of HTTP server per worker vs other alternatives being considered here?

Good question! @tedzhouhk @ishandhanani — from what I’ve seen, our current /metrics and /health routes live on a standalone HTTP server, not as Dynamo endpoints. Do you have any sense why we didn’t fold them into the Dynamo endpoints model instead? What were the trade-offs at the time?

the last discussion I was part of - we wanted to add this to the same endpoint model and enable components to specfiy that it would be visible via http - so still through the runtime

our current /metrics and /health routes live on a standalone HTTP server

The top level HTTP server sits with the frontend/ingress, and this forwards requests to components/workers/etc. over dynamo runtime, and can aggregate some information of it's view of the overall distributed system.

Similarly, it could query the health/status of individual workers over dynamo runtime (ex: exposed per worker as dynamo endpoints, queried over NATS for example), compared to to each worker spinning up an additional HTTP server per worker (exposed per worker as additional HTTP server per worker) and querying each of their HTTP servers - so was curious on that part, if I'm understanding correctly that the proposal is proposing the latter.

that is correct we could additionally do that - but a few points:

I think we wanted each individual component to have a liveness / readiness - and not only the ingress - to be able to check health for individual workers / planner / etc. I'm not myself sold on either aggregate through a central http server or to distribute to each component - so as long as we can get the same information - either way is ok.

there is the discussion to enable metrics via http to be able to leverage prometheus based scraping. If we move to exposing an http metrics endpoint for every worker - I think then I would like to reuse the http server for health / status as well.

To echo what @nnshah1 is saying, we need each running pod and container within (this is why it's critical we get the diagram) to be able to independently report out its current health, so that Kubernetes (or nearly any other orchestrator) can manage its lifecycle appropriately. It then benefits us on metrics as well, but that's separate.

To close the loop here and make it explicit the idea was to extend the dynamo_endpoint syntax to include a transport field and allow endpoints to be exposed as either nats targets or http targets:

@dynamo_endpoint(transport='http') def health(): return True

In the future this could also be used to expose other endpoints such as open ai compatible endpoints from workers directly as well -

rmccorm4 · 2025-06-09T18:46:42Z

deps/0000-metrics-health-check.md

+# also exposes http endpoint which will be queried from k8s
+  @liveness 


How do we plan to expose health/liveness/readiness for rust workers?

my understanding here - is that this is for customization only - so for rust workers the same would available -

whoisj

Good stuff. Requested a single change and provided the updated content.

deps/0000-metrics-health-check.md

Co-authored-by: Hongkuan Zhou <[email protected]> Signed-off-by: Neelay Shah <[email protected]>

deps/0000-metrics-health-check.md

Co-authored-by: Neelay Shah <[email protected]> Signed-off-by: ZichengMa <[email protected]>

ZichengMa added 2 commits June 5, 2025 14:08

add draft for health check and metrics monitoring

4d5b58a

refactor draft about metrics and unified http server

ff27522

hutm reviewed Jun 6, 2025

View reviewed changes

deps/0000-metrics-health-check.md Outdated Show resolved Hide resolved

update metrics for worker node

aa0da5c

itay reviewed Jun 7, 2025

View reviewed changes

whoisj previously requested changes Jun 9, 2025

View reviewed changes

ZichengMa changed the title ~~Zicheng/metric health check~~ Introduce Unified Metrics & Health-Check HTTP Endpoints for Dynamo Components Jun 9, 2025

add some clarification based on comments

d54ee76

rmccorm4 reviewed Jun 9, 2025

View reviewed changes

whoisj requested changes Jun 9, 2025

View reviewed changes

deps/0000-metrics-health-check.md Outdated Show resolved Hide resolved

nnshah1 reviewed Jun 10, 2025

View reviewed changes

deps/0000-metrics-health-check.md Outdated Show resolved Hide resolved

tedzhouhk reviewed Jun 10, 2025

View reviewed changes

deps/0000-metrics-health-check.md Outdated Show resolved Hide resolved

Update deps/0000-metrics-health-check.md

144b7ce

Co-authored-by: Hongkuan Zhou <[email protected]> Signed-off-by: Neelay Shah <[email protected]>

tedzhouhk reviewed Jun 10, 2025

View reviewed changes

deps/0000-metrics-health-check.md Show resolved Hide resolved

ZichengMa and others added 6 commits June 10, 2025 14:47

fix some description about metrics

f361de4

add an arch figure

c607716

Update deps/0000-metrics-health-check.md

816a028

Co-authored-by: Neelay Shah <[email protected]> Signed-off-by: ZichengMa <[email protected]>

clear terminology table

1f88969

modify and delete some unclear statements

c2d4435

delete extra health endpoints

080119a


		The system MUST include a unified HTTP endpoint infrastructure for Dynamo components to expose metrics and health check endpoints

		### REQ 2 Performance Mrtics Requirements for worker nodes


		The proposed solution consists of three main components:

		1. Unified HTTP Server Port: Each Dynamo component will embed a single HTTP server that provides a unified interface for both metrics exposure and health check endpoints, eliminating the need for multiple ports or separate servers per component.


		# Summary

		This proposal introduces a unified HTTP endpoint infrastructure for Dynamo components, enabling comprehensive observability and monitoring capabilities at the component level. The core design centers around embedding an HTTP server within each Dynamo component to expose standardized endpoints for both metrics collection and health monitoring. This approach migrates the existing metrics monitoring system from Hongkaun's [implementation](https://github.com/ai-dynamo/dynamo/pull/1315), while simultaneously introducing robust health check mechanisms including liveness, readiness, and custom health probes. Currently, the metrics collection is implemented in the Rust frontend with Prometheus integration, but lacks a unified approach across all components so we need to migrate to a component-level HTTP endpoint approach.

		# also exposes http endpoint which will be queried from k8s
		@liveness

Introduce Unified Metrics & Health-Check HTTP Endpoints for Dynamo Components #9

Are you sure you want to change the base?

Introduce Unified Metrics & Health-Check HTTP Endpoints for Dynamo Components #9

Uh oh!

Conversation

ZichengMa commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

itay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nnshah1 Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whoisj left a comment

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rmccorm4 Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nnshah1 Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whoisj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ZichengMa commented Jun 6, 2025 •

edited

Loading

nnshah1 Jun 10, 2025 •

edited

Loading

rmccorm4 Jun 9, 2025 •

edited

Loading

rmccorm4 Jun 9, 2025 •

edited

Loading

nnshah1 Jun 10, 2025 •

edited

Loading