Skip to content
175 changes: 175 additions & 0 deletions deps/0000-metrics-health-check.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
# Metrics and Health Check Probe for Dynamo Component

**Status**: Draft

**Authors**: [Zicheng Ma]

**Category**: Architecture

**Sponsor**: [Neelay Shah, Hongkuan Zhou, Zicheng Ma]

**Required Reviewers**: [Neelay Shah, Hongkuan Zhou, Ishan Dhanani, Kyle Kranen, Maksim Khadkevich, Alec Flowers, Biswa Ranjan Panda]

**Review Date**: [Date for review]

**Pull Request**: [Link to Pull Request of the Proposal itself]

**Implementation PR / Tracking Issue**: [Link to Pull Request or Tracking Issue for Implementation]

# Summary

This proposal introduces a unified HTTP endpoint infrastructure for Dynamo components, enabling comprehensive observability and monitoring capabilities at the component level. The core design centers around embedding an HTTP server within each Dynamo component to expose standardized endpoints for both metrics collection and health monitoring. This approach migrates the existing metrics monitoring system from Hongkaun's [implementation](https://github.com/ai-dynamo/dynamo/pull/1315), while simultaneously introducing robust health check mechanisms including liveness, readiness, and custom health probes. Currently, the metrics collection is implemented in the Rust frontend with Prometheus integration, but lacks a unified approach across all components so we need to migrate to a component-level HTTP endpoint approach.
Copy link

@rmccorm4 rmccorm4 Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious - have we considered using dynamo runtime concepts like endpoint that is auto attached (or customizable) to every worker?

Instead of an HTTP server spawned per worker, we instead just have another endpoint with something like this:

@dynamo_endpoint
def health():
  return True

What are the pros/cons of HTTP server per worker vs other alternatives being considered here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! @tedzhouhk @ishandhanani — from what I’ve seen, our current /metrics and /health routes live on a standalone HTTP server, not as Dynamo endpoints. Do you have any sense why we didn’t fold them into the Dynamo endpoints model instead? What were the trade-offs at the time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last discussion I was part of - we wanted to add this to the same endpoint model and enable components to specfiy that it would be visible via http - so still through the runtime

Copy link

@rmccorm4 rmccorm4 Jun 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

our current /metrics and /health routes live on a standalone HTTP server

The top level HTTP server sits with the frontend/ingress, and this forwards requests to components/workers/etc. over dynamo runtime, and can aggregate some information of it's view of the overall distributed system.

Similarly, it could query the health/status of individual workers over dynamo runtime (ex: exposed per worker as dynamo endpoints, queried over NATS for example), compared to to each worker spinning up an additional HTTP server per worker (exposed per worker as additional HTTP server per worker) and querying each of their HTTP servers - so was curious on that part, if I'm understanding correctly that the proposal is proposing the latter.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is correct we could additionally do that - but a few points:

  1. I think we wanted each individual component to have a liveness / readiness - and not only the ingress - to be able to check health for individual workers / planner / etc. I'm not myself sold on either aggregate through a central http server or to distribute to each component - so as long as we can get the same information - either way is ok.

  2. there is the discussion to enable metrics via http to be able to leverage prometheus based scraping. If we move to exposing an http metrics endpoint for every worker - I think then I would like to reuse the http server for health / status as well.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To echo what @nnshah1 is saying, we need each running pod and container within (this is why it's critical we get the diagram) to be able to independently report out its current health, so that Kubernetes (or nearly any other orchestrator) can manage its lifecycle appropriately. It then benefits us on metrics as well, but that's separate.

Copy link
Contributor

@nnshah1 nnshah1 Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop here and make it explicit the idea was to extend the dynamo_endpoint syntax to include a transport field and allow endpoints to be exposed as either nats targets or http targets:

@dynamo_endpoint(transport='http')
def health():
  return True

In the future this could also be used to expose other endpoints such as open ai compatible endpoints from workers directly as well -


The unified endpoint design provides a consistent interface across all Dynamo components, allowing external monitoring systems,
container orchestrators (such as Kubernetes), and operational tools to interact with each component through standard HTTP
protocols. By consolidating metrics exposure and health check functionality into a single HTTP server per component, this
solution simplifies deployment, reduces infrastructure complexity.

# Motivation

Currently, the Dynamo runtime does not provide direct support for comprehensive metrics collection
or standardized health check mechanisms at the component level. While some metrics reporting to
Prometheus exists in the Rust frontend, there is no unified design for aggregating, querying, and managing metrics
across all Dynamo components. Additionally, there is no standardized way to check the health, liveness, and readiness of individual Dynamo components, making it difficult to monitor system health and implement proper load balancing and failover mechanisms.

This lack of observability infrastructure creates several operational challenges:
- No standardized health check mechanism for container orchestration systems (e.g., Kubernetes)
- Limited visibility into component-level metrics and resource utilization
- No centralized way to query and aggregate metrics/health across the distributed system

## Goals

* Implement a unified HTTP endpoint infrastructure for Dynamo components to expose metrics and health check endpoints
* Enable customizable health checks through Python bindings while maintaining core health checks in Rust
* Support standard observability patterns compatible with container orchestration systems


## Requirements

### REQ 1 Unified HTTP Endpoint Infrastructure

The system **MUST** include a unified HTTP endpoint infrastructure for Dynamo components to expose metrics and health check endpoints

### REQ 2 Performance Mrtics Requirements for worker nodes
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the workers (i.e. the frameworks), we need to also have a requirement to allow us to grab the metrics that those frameworks natively expose. You note below the ones we want to capture ourselves from the frameworks (e.g. TTFT, etc), but the frameworks have extensive metrics that should be collectable. We need to understand how this is going to work (e.g. are they going to expose it independently, are we going bring those into our metrics, etc).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZichengMa - @alec-flowers and I have been chatting about this. Right now we have an implementation that works for VLLM but does not for SGLang. We've been thinking about standardizing on prometheus for this. Happy to iterate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the addition here to the proposal then that the worker component that wraps the framework will aggregate metrics from the framework and any other additional metrics into one endpoint? I think that makes sense in general - and we can add that here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, the rust frontend's metrics are deployment-level. We also need worker-level metrics. The challenge here is that worker-level metrics may come from the backend itself instead of dynamo scripts.


The metrics from rust frontend we want to monitor **MUST** include:
- Inflight/Total Request: updated when a new request arrives (and finishes for inflight)
- TTFT: reported at first chunk response
- ITL: reported at each new chunk response
- ISL: reported at first chunk response (TODO: report right after tokenization)
- OSL: reported when requests finishes

The system **MUST** provide a standardized approach to collect and expose native metrics from AI inference frameworks (e.g., vLLM, SGLang, TensorRT-LLM) through the unified HTTP endpoint, using Prometheus format as the standard metrics exposition format.

### REQ 3 Component Health Check Endpoints

Each Dynamo component **MUST** expose HTTP endpoint for health monitoring:
- `/health` - Overall component health status
We will use `/health` for both liveness and readiness probes. If there is extra health check needed from k8s operator, we can add more endpoints.


### REQ 4 Core Health Check Implementation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have a description on what is the acceptable latency range for the healthcheck endpoints, otherwise we risk more and more "checks" being stuffed into them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latency range should be handled on k8s controller side? Let controller determines what is an appropriate range?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is configurable at the cluster level and per pod.


The Rust runtime **MUST** implement basic health checks including:
- etcd connectivity and lease status
- NATS connectivity and service registration status

### REQ 5 Extensible Health Check Framework

The system **MUST** provide Python bindings that allow users to:
- Register custom health check functions



# Architecture
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'd like to see is an overall diagram, let's make it Kubernetes-oriented, that shows a canonical graph (e.g. a disagg one) with the individual pods, and within them the components that are running, and the HTTP port/endpoints that are exposed.


## Overview

The proposed solution consists of three main components:

1. **Unified HTTP Server Port**: Each Dynamo component will embed a single HTTP server that provides a unified interface for both metrics exposure and health check endpoints, eliminating the need for multiple ports or separate servers per component.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You say single HTTP server, but in the case of a component (e.g. the frontend) that has its own independent HTTP server, I imagine that it would be independent of that?

Specifically, the metric/healthcheck HTTP server should ideally sit at least on an independent port (ideally bound to only 127.0.0.1) from the core server, so as to ensure that these endpoints are not exposed externally.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so long as the fabric, assuming that there is one, can access the necessary endpoints to accurate determine health and readiness state, I'm fine with this.

I do worry (because a lot of things tend be Python based) that additional endpoints could end up incurring addition processes and thereby significant, unnecessary overhead. I'd like to see a commitment to avoid these issues, even if that means the core implementation is not in Python.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example - that because we start a second process or a second HTTP service that we're consuming more RAM/CPU? Just making sure I understand.

Copy link
Contributor

@nnshah1 nnshah1 Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the proposal here - and maybe needs to be flushed out -is that we would be reusing the rust based http server.

In terms of the frontend specifically - I think that depends a little on if we consider the frontend to be externally exposed or always behind an ingress ...

It might be simpler for us to define every dynamo component to have a single server with multiple endpoints (health/metrics/functions) - and then for a deployment the frontend would be behind an ingress to handle all auth / security related things

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's an example - that because we start a second process or a second HTTP service that we're consuming more RAM/CPU? Just making sure I understand.

Every process has it's own memory footprint and any interaction between processes has to be done via shared memory (more footprint). Additionally, the OS has to manage thread and hardware congestion of multiple processes in a "fair scheduler" manner instead of the more efficient "in-process" ordering that occurs.

When it's the "right answer", multi-process is great but it's not free. When it is "just another answer" the overhead needs to balanced w/ the benefits. That's all I am saying.

The reason I mentioned Python is because Python doesn't support multi-threading, and instead resorts to multi-process (w/ shared memory, etc.) to affect parallel operations.


2. **Metrics**: Component-level metrics collection and exposure through standardized HTTP endpoints, migrating from the existing approach implemented for Rust frontend where each component serves its own metrics data in standard Prometheus format.

3. **Health Check**: Comprehensive health monitoring system with both Rust-implemented core health checks (etcd, NATS connectivity) and extensible Python-binding framework for custom health checks, exposed through standard HTTP endpoints compatible with container orchestration systems.

<figure>
<img src="imgs/metric%20and%20helath%20check%20arch.png" alt="Metrics and Health Check Architecture" width="800">
<figcaption>Figure 1: Metrics and Health Check Architecture Overview</figcaption>
</figure>

## Unified HTTP Server Port

Each Dynamo DRT/component will embed an HTTP server when it first registers an Endpoint. The HTTP server will be used to expose metrics and health check endpoints.

Once the HTTP server is embedded, one entry contains the HTTP server port will be registered to etcd for discovery. Each DRT will have lock to avoid race condition and DRT is responsible to check whether the HTTP server has already been booted. i.e. When a DRT opens many endpoints at the same time, it will check if the HTTP server has already been booted and if not, it will boot the HTTP server and register the port to etcd.



## Metrics Architecture
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are metrics collected from within the component, e.g. if you need to move some metrics from the Python side? If I have a custom component that has custom metrics, how do I easily get those exposed?


To be done


## Component Health Check Architecture

Each Dynamo component will embed an HTTP server that exposes health check endpoints:

**Core Health Check Implementation (Rust):**
- **etcd Health**: Verify etcd connectivity and lease validity
- **NATS Health**: Check NATS connection and service group registration
- **Runtime Health**: Validate distributed runtime state and component registration

**Extensible Health Check Framework (Python Bindings)**

Python bindings will allow users to register custom health check functions(example from Biswa Ranjan Panda during discussion):

```python
# Custom health check example
from dynamo.runtime import liveness # implemented in rust

@service
class MyService:
# used by rust (to renew etcd lease)
# also exposes http endpoint which will be queried from k8s
@liveness
Comment on lines +134 to +135
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we plan to expose health/liveness/readiness for rust workers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding here - is that this is for customization only - so for rust workers the same would available -

async def foo():
self.vllm_engine.health() == HEALTHY
```


### Rust Core Implementation design
Modification will mainly happen in the `lib/src/runtime/distributed.rs` and `lib/src/runtime/component/endpoint.rs`

Draft PR: https://github.com/ai-dynamo/dynamo/pull/1504

### Python Binding Interface design
To be done






# Background

## References

- [Kubernetes Health Check Guidelines](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
- [Prometheus Metrics Format](https://prometheus.io/docs/instrumenting/exposition_formats/)
- [RFC-2119 - Key words for use in RFCs to Indicate Requirement Levels](https://datatracker.ietf.org/doc/html/rfc2119)

## Terminology & Definitions

| **Term** | **Definition** |
| :------------------ | :------------------------------------------------------------------------------------------- |
| **Health Check** | A mechanism to verify if a component or service is functioning correctly |
| **Liveness Probe** | A health check that determines if a component is running and should be restarted if failing |
| **Readiness Probe** | A health check that determines if a component is ready to receive traffic |
| **Metrics Gateway** | A centralized service that collects, aggregates, and serves metrics from multiple components |
| **Scraping** | The process of collecting metrics data from components at regular intervals |
| **TTFT** | Time To First Token |
| **ITL** | Inter Token Latency |
| **ISL** | Input Sequence Length |
| **OSL** | Output Sequence Length |

Binary file added deps/imgs/metric and helath check arch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.