Latency spike in distributors but not in ingesters #9714
Replies: 4 comments
-
As that line should be reading from the client connection, is it possible that the latency stems from the client? It certainly looks that way to me, especially considering the trace you included (~18 seconds spent reading the proto payload). |
Beta Was this translation helpful? Give feedback.
-
I suspect this should be a discussion and not an issue, since there's no indication of a Mimir bug yet. The symptoms so far are of an operational issue. |
Beta Was this translation helpful? Give feedback.
-
Hi, sorry I forgot about discussions, do you have the permissions by any chances to change into one? |
Beta Was this translation helpful? Give feedback.
-
Yeah that's what I am also starting to suspect the interesting thing is we don't see any increased latency on the ingress (contour based on envoy) but this might be down to what exactly the envoy metrics measure. |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
A clear and concise description of what the bug is.
At random times we are observing latency spikes in distributors that last from 1-2 minutes to 1-2 hours. They came and go (resolve on their own).
Whey they happen we don't see any corresponding latency spike in ingesters.
We don't seem to be hitting cpu / memory limits in distributors and adding more distributor pods doesn't seem have any effect.
Mimir overview
Mimir writes
Mimir writes resources
Mimir writes networking
(The gap in metrics is unrelated, our prometheus scraping mimir was stuck in terminating state)
From traces we can see the latency seems to be spent in distributor and not in ingesters.
The time seems to be spent in between 2 span events:
We can see it affects requests from the prometheus replicas from HA pair which is not elected and the data from it is not written to ingesters.
In our dev env we deployed modified version of mimir with 2 extra span events trying to narrow down where is the time spent.
Source with the modifications https://github.com/grafana/mimir/pull/9707/files.
And the source of the latency seems to be https://github.com/grafana/mimir/pull/9707/files#diff-e290efac4355b20d1b6858649bd29946bab12964336247c96ad1370e51e4503bR248.
It happens to in both of our environments production and dev.
It seems to affected subset of our tenants.
To Reproduce
Steps to reproduce the behavior:
These latency spike seem to happen at random, so not sure how to reproduce them.
Expected behavior
No unexpected latency spikes.
Environment
Additional Context
We didn't find anything in the logs of mimir distributor, given it doesn't log each incoming request, and doesn't log much.
Beta Was this translation helpful? Give feedback.
All reactions