Latency spike in distributors but not in ingesters #9714

jmichalek132 · 2024-10-22T12:41:16Z

jmichalek132
Oct 22, 2024

Describe the bug

A clear and concise description of what the bug is.

At random times we are observing latency spikes in distributors that last from 1-2 minutes to 1-2 hours. They came and go (resolve on their own).
Whey they happen we don't see any corresponding latency spike in ingesters.
We don't seem to be hitting cpu / memory limits in distributors and adding more distributor pods doesn't seem have any effect.

Mimir overview

Mimir writes

Mimir writes resources

Mimir writes networking

(The gap in metrics is unrelated, our prometheus scraping mimir was stuck in terminating state)

From traces we can see the latency seems to be spent in distributor and not in ingesters.

The time seems to be spent in between 2 span events:

We can see it affects requests from the prometheus replicas from HA pair which is not elected and the data from it is not written to ingesters.

In our dev env we deployed modified version of mimir with 2 extra span events trying to narrow down where is the time spent.
Source with the modifications https://github.com/grafana/mimir/pull/9707/files.

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

It happens to in both of our environments production and dev.
It seems to affected subset of our tenants.

To Reproduce

Steps to reproduce the behavior:

These latency spike seem to happen at random, so not sure how to reproduce them.

Start Mimir 2.14
Perform Operations(Read/Write/Others)

Expected behavior

No unexpected latency spikes.

Environment

Infrastructure: kubernetes v1.29.5 (AKS, azure)
Deployment tool: helm chart mimir-distributed 5.5.0

Additional Context

We didn't find anything in the logs of mimir distributor, given it doesn't log each incoming request, and doesn't log much.

aknuds1 · 2024-10-22T15:40:31Z

aknuds1
Oct 22, 2024
Maintainer

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

As that line should be reading from the client connection, is it possible that the latency stems from the client? It certainly looks that way to me, especially considering the trace you included (~18 seconds spent reading the proto payload).

0 replies

aknuds1 · 2024-10-22T15:42:27Z

aknuds1
Oct 22, 2024
Maintainer

I suspect this should be a discussion and not an issue, since there's no indication of a Mimir bug yet. The symptoms so far are of an operational issue.

0 replies

jmichalek132 · 2024-10-22T18:27:48Z

jmichalek132
Oct 22, 2024
Author

I suspect this should be a discussion and not an issue, since there's no indication of a Mimir bug yet. The symptoms so far are of an operational issue.

Hi, sorry I forgot about discussions, do you have the permissions by any chances to change into one?

0 replies

jmichalek132 · 2024-10-22T18:39:29Z

jmichalek132
Oct 22, 2024
Author

And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.

As that line should be reading from the client connection, is it possible that the latency stems from the client? It certainly looks that way to me, especially considering the trace you included (~18 seconds spent reading the proto payload).

Yeah that's what I am also starting to suspect the interesting thing is we don't see any increased latency on the ingress (contour based on envoy) but this might be down to what exactly the envoy metrics measure.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency spike in distributors but not in ingesters #9714

{{title}}

Replies: 4 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Latency spike in distributors but not in ingesters #9714

jmichalek132 Oct 22, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

Replies: 4 comments

aknuds1 Oct 22, 2024 Maintainer

aknuds1 Oct 22, 2024 Maintainer

jmichalek132 Oct 22, 2024 Author

jmichalek132 Oct 22, 2024 Author

jmichalek132
Oct 22, 2024

aknuds1
Oct 22, 2024
Maintainer

aknuds1
Oct 22, 2024
Maintainer

jmichalek132
Oct 22, 2024
Author

jmichalek132
Oct 22, 2024
Author