Linkerd Proxy memory usage increase & OOM when app response with ~5MB payload over ~12 requests/sec #11077
Replies: 5 comments 14 replies
-
Kindly asking for some help here 🙏🏼 This issue is plaguing us and there's no clear path forward on how to diagnose and/or fix, so any advice would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
Just to confirm, the proxy that's OOMing is the one injected into the application container that serves the long payload, correct? It could also be useful to know if the proxies memory usage remains elevated after stopping traffic, or if the memory usage returns to a lower state when traffic is temporarily terminated. Regarding your speculations:
This shouldn't be the case --- as soon as the proxy receives a chunk of body data from the application, it should be forwarding that chunk to the client immediately. The only time the proxy is supposed to hold an entire body payload in memory is when it's a request body with a |
Beta Was this translation helpful? Give feedback.
-
Hey @hawkw 👋
correct, the proxies that are injected into the application pods in question are the ones getting OOMed. Indeed when we reduce/stop the traffic, memory usage of the proxies is reduced to more or less same levels as prior to introducing the traffic.
Ok, in this case I'm out of ideas. |
Beta Was this translation helpful? Give feedback.
-
Good questions, the http response is HTTP/1.1 but I'm not certain about the transfer encoding, let me check and get back to you. |
Beta Was this translation helpful? Give feedback.
-
Hey @hawkw 👋🏼 Just wanted to summarize our current status and some details on additional observations & tests that we've done, some of these I shared with you in DM, but wanted to document it here nonetheless.
We feel confident at this point that there shouldn't be any bottlenecks in the overall transport & components (clients, nginx, application, network, etc...) and we're able to scale up the load considerably without any impact to any of these components when running without the linkerd proxy on the app Pods, at peak load we were able to reach ~850MB/s of throughput, while with linkerd it would OOM around ~550MB/s. We're not sure how to proceed, we could probably increase the memory allocation on the linkerd proxy, but without understanding why this is happening we don't quite feel confident with this approach. I realize also we're running a fairly outdated version of Linkerd (2.12.4), do you think there's anything in the newer versions that might have an impact on what we're seeing? any other thoughts or suggestions? 🙏🏼 |
Beta Was this translation helpful? Give feedback.
-
Hello friends 👋🏼
I'm hoping to get some guidance/direction on how to troubleshoot something we're experiencing on one of our services.
Environment: linkerd 2.12.4 on K8s 1.24.9
We have a fairly simple application that serves HTTP requests (and sits behind an nginx ingress controller), ingress controllers & application Pods are meshed, the requests hitting this app usually have a small response size however from time to time it needs to serve a larger payload (~5MB, compressed payload) as a response.
Here's a very basic flow diagram:
The application is written in Go, using
net/http
to serve requests, runs on 8 pods and is sufficiently sized (based on cpu/memory usage observations), each pod is receiving ~10 req/s, however when the app responds with a ~5MB payload (per request) we notice that the linkerd proxy sidecar memory utilization increases quite rapidly and if we increase the load a little further (to ~12 req/s per pod) linkerd proxy eventually OOMs, there are roughly ~70 inbound connections on each pod.Bandwidth wise, each Pod is responding at ~20MB/s (@~10 req/s), which drives the memory usage on the proxy to ~200MB, when we increase the load slightly, each Pod is sending ~25MB/s and that's when linkerd proxy eventually OOMs.
We're trying to understand where the bottleneck here might be.
we understand we can increase the proxy memory requests & limits, but when we tried that it just seem to have shifted the issue downstream (to the nginx ingress controller linkerd proxies), which caused a much bigger issue as that impacted all services behind ingress.
(speculation starts here 😄)
It appears like the proxy is waiting for the application to respond with the payload while buffering the response data in memory and releases it only after the full response was received? the application inbound latency increases only for p99 to around 500~700ms when it's serving the larger payload, otherwise we're not seeing anything abnormal and not quite sure how to troubleshoot this further.
I reviewed other issues/discussions I found on this topic, this one suggests that memory would increase as proxy need to handle more connections, which correlates with what we see on the ingress controllers in general and have increased linkerd proxy memory allocations there, however, this doesn't seem to explain what we're seeing on the application pods, they each have a steady ~70 inbound connections and this only happens when they are serving the larger payload (~5MB) as a response.
any guidance/assistance would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions