Skip to content

Conversation

nanahpang
Copy link

@nanahpang nanahpang commented Apr 22, 2024

Add the proposal of gRPC TCP per-connection metrics.

Copy link
Contributor

@yousukseung yousukseung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to me as discussed offline, I'll leave others to approve.

Copy link
Member

@yashykt yashykt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(approved by mistake)

@markdroth markdroth changed the title Create A80-grpc-metrics-for-tcp-connection A80: gRPC Metrics for TCP connection Apr 29, 2024
Copy link
Member

@markdroth markdroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need review from @ejona86 and @dfawley to make sure these definitions are still okay in terms of potential future implementation in other languages.

@nanahpang nanahpang requested review from ejona86 and dfawley and removed request for atollena May 10, 2024 20:32

| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Histogram (double) | s | None | Records TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints.<br /> RTT = packet acked timestamp - packet sent timestamp. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Buckets for histogram are explicitly specified for request latency in https://github.com/grpc/proposal/blob/master/A66-otel-stats.md. Do you plan to reuse the same buckets? It might be worth specifying.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion, but using buckets for the min_rtt metric wouldn't offer much significant insights. The value of min_rtt is bound to the physical length of the path between sender and receiver or the queueing time caused by throttling and load. A step-change in min_rtt values usually means that traffic is being throttled or experiencing congestion, or has been re-routed through a different path. I think it's better to opt out buckets in this situation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nanahpang If you do not specify buckets, we will get the default OpenTelemetry buckets which is { 0, 5, 10, 25, 50, 75,100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000}. Are there better defaults that we can suggest or do we leave it to the user to figure it out?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding OpenTelemetry buckets, if you're referring to a plugin used for exporting metrics, that would be a separate concern from the metric types we're defining here. However, I agree that specifying a smaller range for the min_rtt metric buckets is the right approach, especially since we're using seconds as the unit (to align with the open-source standard) while the underlying measurements are usually in microseconds.


| Name | Type | Unit | Labels | Description |
| ------------- | ----- | ----- | ------- | ----------- |
| grpc.tcp.min_rtt | Histogram (double) | s | None | Records TCP's current estimate of minimum round trip time (RTT), typically used as an indication of the network health between two endpoints.<br /> RTT = packet acked timestamp - packet sent timestamp. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nanahpang If you do not specify buckets, we will get the default OpenTelemetry buckets which is { 0, 5, 10, 25, 50, 75,100, 250, 500, 750, 1000, 2500, 5000, 7500, 10000}. Are there better defaults that we can suggest or do we leave it to the user to figure it out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants