Reduce the metrics cardinality #160

ShoshinNikita · 2023-06-14T15:52:15Z

Right now the metric http_response_time_seconds has one label - path, and r.URL.Path is used as its value. This can be an issue when a service behind Reproxy uses dynamic urls (for example, /api/v1/users/{user_id}). In this case, the response size of /metrics may become too large to process by Prometheus (especially when resources are limited).

We can't just drop this label because of the backward compatibility promise. However, we can make it optional. It would be enabled by default - again, to comply with the backward compatibility promise.

I also have a few more suggestions (more → less important):

Add the label server to all metrics. It feels wrong that only http_requests_total has this label. For example, it prevents me from creating an alert for 500+ statuses for specific upstreams.
Add metrics http_request_size_bytes{server} and http_response_size_bytes{server} that would allow users to monitor the size of requests and responses.
Add new buckets for http_response_time_seconds - for example, 10 and 30. The current "highest" bucket is 5s, but it can take much longer to process some types of requests. At the same time, I understand that it's hard to find a good and fitting maximum value. Another option it to make the buckets configurable - but I am not sure it would be practical.

The text was updated successfully, but these errors were encountered:

mikelorant · 2024-05-06T23:20:30Z

This is a major issue when you identify your /metrics endpoint has a payload over 34MB.

@umputun This makes the metrics endpoint unusable for large production systems.

umputun · 2024-05-25T17:30:08Z

Can you provide a PR for this change?

mikelorant · 2024-05-26T04:13:38Z

I have been doing this type of work already for an issue with fastly/fastly-exporter#152 which has resulted in the solution fastly/fastly-exporter#153.

@umputun Let me know which direction you'd like to go in based on the feedback provided by @ShoshinNikita.

My question is mostly around do we drop the field or keep the field with the value of aggregated? The reasons for keeping the field is it wouldn't break any dashboards that rely on this field.

Can you also provide your preferred name for the CLI argument to toggle this feature?

I would only plan to add this specific feature and not the other recommendations from @ShoshinNikita.

aliksend · 2024-12-06T00:24:02Z

I suggest not to use r.URL.Path as "path" label, but use discovery.URLMapper.SrcMatch.
It will be a breaking change so it must be configurable.
After this change metrics will only contain records for configured routes, not for all used routes

Also I suggest to add label "server" to all metrics to follow a unified approach and to make buckets for http_response_time_seconds configurable.

I can make a PR with this improvements

umputun · 2024-12-06T02:09:35Z

sounds like a good idea to me. Probably should be some new cil/env option(s) to turn this "low cardinality" metrics on

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce the metrics cardinality #160

Reduce the metrics cardinality #160

ShoshinNikita commented Jun 14, 2023

mikelorant commented May 6, 2024

umputun commented May 25, 2024

mikelorant commented May 26, 2024

aliksend commented Dec 6, 2024

umputun commented Dec 6, 2024

Reduce the metrics cardinality #160

Reduce the metrics cardinality #160

Comments

ShoshinNikita commented Jun 14, 2023

mikelorant commented May 6, 2024

umputun commented May 25, 2024

mikelorant commented May 26, 2024

aliksend commented Dec 6, 2024

umputun commented Dec 6, 2024