Getting 502/504 with Pod Readiness Gates during rolling updates #1719

calvinbui · 2020-12-10T02:13:32Z

I'm making use of the Pod Readiness Gate on Kubernetes Deployments running Golang-based APIs. The goal is to achieve full zero downtime deployments.

During a rolling update of the Kubernetes Deployment, I'm getting 502/504 responses from these APIs. This did not happen when setting target-type: instance.

I believe the problem is that AWS does not drain the pod from the LB before Kubernetes terminates it

Timeline of events:

Perform a rolling update on the deployment (1 replica)
A second pod is created in the deployment
AWS registers a second target in the Load Balancing Target Group
Both pods begin receiving traffic
I'm not sure what happens first at this point:
a. AWS begins de-registered/drained the target
b. Kubernetes begins terminating the pod
Traffic sent to the deployment begins receiving 502 and 504 errors
The old pod is deleted
Traffic returns to normal (200)
The target is de-registered/drained (depending on delay)

This is tested with a looping curl command:

while true; do
  curl --write-out '%{url_effective} - %{http_code} -' --silent --output /dev/null -L https://example.com | pv -N "$(date +"%T")" -t
  sleep 1
done

Results:

https://example.com - 200 - 13:04:16: 0:00:00
https://example.com - 502 - 13:04:17: 0:00:01
https://example.com - 200 - 13:04:20: 0:00:00
https://example.com - 504 - 13:04:31: 0:00:10
https://example.com - 200 - 13:04:32: 0:00:00
https://example.com - 200 - 13:04:33: 0:00:00
https://example.com - 200 - 13:04:34: 0:00:00
https://example.com - 200 - 13:04:35: 0:00:00
https://example.com - 200 - 13:04:36: 0:00:00

The text was updated successfully, but these errors were encountered:

AirbornePorcine · 2020-12-11T20:49:28Z

We've been having the same issue. We confirmed with AWS that there is some propagation time between when some target is marked draining in a target group, and when that target actually stops receiving new connections. So, at the suggestion of other issues I've seen in the old project for this, we added a 20s sleep in a preStop script. This hasn't entirely eliminated them though, they still happen on deployment, just not with as much volume. Following this to see if anyone else has any good ideas, as troubleshooting these 502s has been infuriatingly difficult.

M00nF1sh · 2020-12-11T20:57:19Z

@calvinbui The pods needs to have a preStop hook to sleep. since most web framework(e.g. nginx/apache) will stop accept new connections once requested soft stop(sigTerm). and it take some time for the controller to deregister pod(after got endpoint change event), and take time for elb to propagate target changes to it's dataplane.

@AirbornePorcine did you still saw 502 with 20s sleep? have you enabled pod readinessGate? If you are using instance mode, u need 30 second extra sleep(since kubeproxy update iptable rules per 30 second).

AirbornePorcine · 2020-12-11T20:59:50Z

@M00nF1sh that's correct, even with a 20s sleep and the auto-injected readinessGate, doing a rolling restart of my pods results in a small amount of 502s. For reference this is like 5-6 502s out of 1m total requests in the same time period, so a very small amount, but still not something we want. I'm using IP mode here.

M00nF1sh · 2020-12-11T21:32:40Z

@AirbornePorcine in my own test, the sum of controller process time(from pod kill to target deregistered) and ELB API propagation time(from deregister API call to targets actually removed from ELB dataplane) takes less than 10 second.

And the PreStop hook sleep only need to be controller process time + ELB API propagation time + HTTP req/resp RTT.

Just asked ELB team whether they have p90/p99 metrics available for ELB API propagation time. If so, we recommend a safe PreStop sleep.

AirbornePorcine · 2020-12-11T22:08:21Z

Ok, so, we just did some additional testing on that sleep timing.

The only way we've been able to get zero 502s during a rolling deploy, is to set our preStop sleep to the target group's deregistration delay + at least 5s. It seems almost like there's no way to guarantee that AWS isn't actually sending you new requests, until the target is fully removed from the target group, and not just marked "draining".

Looking back in my emails, I realized this is exactly what AWS support had previously told us to do - don't stop the target from processing requests until the target group deregistration delay has elapsed at minimum (we added the 5s to account for the controller process and propagation time as you mentioned).

Next week we'll try tweaking our deregistration delay and see if the same holds true (it's currently 60s, but we really don't want to sleep that long if we can avoid it)

Something you might want to try though @calvinbui!

calvinbui · 2020-12-15T00:17:02Z

Thanks for the comments.

Adding a preStop and sleep, I was able to get all 200s during a rolling update of the deployment. I set deregistration time to 20 seconds and sleep to 30 seconds.

However during a node upgrade/rolling update I got 503s for around one minute. Are there any recommendations from AWS about that? I'm guessing I would need to bump up the deregistration and probably the sleep times a lot higher to allow the new node to fire up and the new pods to start as well.

calvinbui · 2020-12-24T00:45:34Z

After increasing sleep to 90s and terminationGracePeriod to 120s there are no downtimes during a cluster upgrade/node upgrade on EKS.

However, if a deployment only has 1 replica, there is still ~1 min of downtime. For deployments with >=2 replicas, this was not a problem and no downtime was observed.

The documentation should be updated, so I'll leave this issue open.

EDIT: For the 1 replica issue, it was because k8s doesn't do a rolling deployment during a cluster/node upgrade. It is considered involuntary so I had to scale up to 2 replicas and add a PDB

foriequal0 · 2021-01-20T17:17:16Z

How about abusing (?) validationAdmissionWebHook for delaying pod deletion? Here's the sketch of the idea:

ValidataionAdmissionWebhook intercepts pod deletion. It won't allow deletion of the pod if the pod is is reachable from the alb, ip type ingress first.
However, it patches the pod. It removes labels and ownerReferences so it is removed from ReplicationSet and Endpoint. Also ELB starts draining since it is removed from Endpoint.
After some time passes, and ELB finishes its draining, the pod is deleted by aws-load-balancer-controller.

edit: I've implemented this idea into a chart here. https://github.com/foriequal0/pod-graceful-drain

k8s-triage-robot · 2021-09-26T02:34:21Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

project0 · 2021-12-08T09:05:48Z

This is still a serious issue, any update on it? We use currently the solution from @foriequal0 which is really doing a great job so far. I wish this would be officially handled by the controller project itself.

k8s-triage-robot · 2022-03-08T09:57:22Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

project0 · 2022-03-08T10:08:04Z

/remove-lifecycle stale

ardove · 2022-03-25T17:58:46Z

What's the protocol for getting this prioritized? We've hit it as well. This is a serious issue and while I understand there's a workaround (hack), it's certainly reducing my confidence in running production workloads on this thing.

albgus · 2022-05-10T11:34:20Z

I'm also seeing this issue, but I think it's not necessarily an issue with the LB Controller? It seems draining for NLBs doesn't work as I would have expected. Instead of stopping new connections and letting existing connections continue it continues to send new connections to the draining targets for a while.

From my testing the actual delay for a target to be fully de-registered and drained seems to be around 2-3 minutes.

Adding this to each container exposed behind an NLB have worked for me so far.

          lifecycle:
            preStop:
              exec:
                command: [ sh, -c, "sleep 180" ]

I would love to be able to get rid of this but it simply seems that the NLBs are extremely slow in performing management operations. I have even seen target registrations take almost 10 minutes.

dfinucane · 2022-07-15T17:01:12Z

I completely agree with what @ardove has said.

The point of this readinessGate feature is to delay the termination of the pod as long as the LB needs it. If I have to update my chart to put a sleep in the preStop hook then it means that this feature is not working. If I have to use the preStop hook then i might as well not even use this readinessGate feature.

In my observation the pod is allowed to terminate as soon as the new target group becomes ready/healthy. I have seen that the old target group was still draining after the pod terminates and obviously that's going to result in 502 errors for those requests.

This feature almost works. Without the feature enabled I see 30 seconds to 1 minute of solid 502 errors. With the feature enabled I get a brief sluggishness and maybe 1 or a handful of 502's. Hopefully you can get this fixed because unfortunately close to good isn't good enough for something like this.

aaron-hastings-travelport · 2022-07-18T12:19:19Z

I thought it might be useful to share this KubeCon talk, "The Gotchas of Zero-Downtime Traffic /w Kubernetes", where the speaker goes into the strategies required for zero-downtime rolling updates with Kubernetes deployments (at least as of 2022):

https://www.youtube.com/watch?v=0o5C12kzEDI

It can be a bit hard to conceptualise the limitations of the async nature of Ingress/Endpoint objects and Pod termination, so I found the above talk (and live demo) helped a lot.

Hopefully it's useful for others.

jyotibhanot · 2022-08-12T13:42:34Z

@M00nF1sh I am implementing the same in my kubernetes cluster but unable to calculate the sleep time for prestop hook and terminationGracePeriodSeconds. Currently terminationGracePeriodSeconds is 120 seconds, deregistration delay is 300 seconds.Do we have any mechanism to calculate this?

project0 · 2022-11-22T15:58:54Z

Does anyone have a update on this? After almost two years i cannot see that it has been solved natively yet.

project0 · 2022-11-22T17:45:11Z

I wonder if finalizers would solve this problem nicely here 🤔

project0 · 2022-12-07T16:54:12Z

For clusters using traefik proxy as ingress it might be worth looking also into the entrypoint lifecycle feature to control graceful shutdowns https://doc.traefik.io/traefik/routing/entrypoints/#lifecycle.
At least in this case it avoids the need for the sleep workaround :-)

smulikHakipod · 2023-01-11T17:47:10Z

https://www.reddit.com/r/ProgrammerHumor/comments/1092kmf/just_add_sleep/j3vqiv2?utm_medium=android_app&utm_source=share&context=3

k8s-triage-robot · 2023-04-11T18:08:57Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2023-05-11T18:38:24Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

dongho-jung · 2023-05-13T11:49:14Z

would EndpointSlice terminating condition solve this issue? it says "Consumers of the EndpointSlice API, such as Kube-proxy and Ingress Controllers, can now use these conditions to coordinate connection draining events, by continuing to forward traffic for existing connections but rerouting new connections to other non-terminating endpoints." but i'm not sure it would work too in this case

https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/

ThisIsQasim · 2023-05-13T15:07:48Z

/remove-lifecycle rotten

rkubik-hostersi · 2023-06-28T07:10:37Z

Bumping this issue. Adding sleep() does not sound professional, it's a workaround and only workaround :/

dusansusic · 2023-11-22T10:13:14Z

I am experiencing this issue, too.

OverStruck · 2023-11-28T20:54:24Z

any update? does pod readiness gate work w/ v2.6 ?

k8s-triage-robot · 2024-02-26T21:49:27Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-03-27T22:31:00Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

kdavh · 2024-04-15T05:23:58Z

Hi folks, I wanted to add that I experimented with all suggested solutions here and what finally worked for me.

I tried extra sleep during preStop for container and matching extra terminationGracePeriod for the pod, reducing ALB deregistration delay, during preStop explicitly turning the pod healthcheck unhealthy among various experiments and combinations. Even extending the termination for 10 minutes didn't stop traffic continually flowing from the ALBs and the small number of errors right as the pods finished termination.

--> I finally tried turning alb.ingress.kubernetes.io/target-type from instance to ip and that fixed it.

After reflecting, I don't know why I thought instance would ever work cleanly. The ALB is tracking node health, and the incoming and outgoing pods can be arranged randomly on those nodes. I'm not even sure from the ALB's perspective it ever saw a node go unhealthy, because there are multiple pods on each node, so healthcheck always periodically passes.

dickfickling · 2024-04-17T17:12:59Z

/remove-lifecycle rotten

adityapatadia · 2024-04-26T19:18:27Z

If anyone faces this, you should do this:

Use alb.ingress.kubernetes.io/target-type: ip
Make sure you use v2.x of ALB Controller and set this label on the namespace where you are putting your pods: kubectl label namespace <your_namespace> elbv2.k8s.aws/pod-readiness-gate-inject=enabled
Reduce de-register delay by applying this to your Ingress: alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30 (by default it's 300 second which is too high)
Setup that sleep delay with preStop hook.

More information in this long article: https://easoncao.com/zero-downtime-deployment-when-using-alb-ingress-controller-on-amazon-eks-and-prevent-502-error/

This makes 502/504 go away completely.

stepan-romankov · 2024-05-21T08:40:14Z

If anyone faces this, you should do this:

Use alb.ingress.kubernetes.io/target-type: ip

Make sure you use v2.x of ALB Controller and set this label on the namespace where you are putting your pods: kubectl label namespace <your_namespace> elbv2.k8s.aws/pod-readiness-gate-inject=enabled

Reduce de-register delay by applying this to your Ingress: alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30 (by default it's 300 second which is too high)

Setup that sleep delay with preStop hook.

More information in this long article: https://easoncao.com/zero-downtime-deployment-when-using-alb-ingress-controller-on-amazon-eks-and-prevent-502-error/

This makes 502/504 go away completely.

I did like you described in your article and still have 502/504 issue when I curl my health endpoint every millisecond.

{"message":"pong"} Status code: 200 Latency: 0.090163s
502 Bad Gateway Status code: 502 Latency: 0.091805s
{"message":"pong"} Status code: 200 Latency: 0.104470s
{"message":"pong"} Status code: 200 Latency: 0.094271s
504 Gateway Time-out Status code: 504 Latency: 10.104198s
{"message":"pong"} Status code: 200 Latency: 0.083560s
{"message":"pong"} Status code: 200 Latency: 0.090153s
{"message":"pong"} Status code: 200 Latency: 0.080708s
502 Bad Gateway Status code: 502 Latency: 3.153344s
{"message":"pong"} Status code: 200 Latency: 0.088603s

unifyapps-saleem · 2024-06-24T13:22:34Z

Hi Team,

Have followed above steps, but no luck. am still facing 502.
Any other workaround to fix this.

stepan-romankov · 2024-06-24T14:29:03Z

Hi Team,

Have followed above steps, but no luck. am still facing 502. Any other workaround to fix this.

check that you have "sh" in you container, e.g. if you are using gcr.io/distroless/base ensure that you use gcr.io/distroless/base:debug-nonroot-amd64 version which includes /busybox/sh. preStop setting you you kubernetes manifest should also be adjusted with "/busybox/sh"

unifyapps-saleem · 2024-06-24T15:13:07Z

Hey Stepan,

We are using node:lts-alpine & amazoncorretto:21-alpine-jdk images. sh is present in it

veludcx · 2024-09-19T07:18:18Z

Hi Team , Is there any solution for this problem, Addind readiness gate , reduce exponential backup , prestop hook none of them help in fix the issue

k8s-triage-robot · 2024-12-18T07:21:51Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

cbugneac-nex · 2024-12-30T11:39:16Z

/remove-lifecycle stale

foriequal0 mentioned this issue Jan 25, 2021

Delay pod deletion to handle deregistraton delay #1775

Closed

sechunOH mentioned this issue Jun 28, 2021

How can we find right value of sleep time for zero downtime rolling update? #2106

Closed

shubham391 mentioned this issue Aug 24, 2021

Document zero-downtime deployment for IP targets #2131

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2021

kishorj removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 27, 2021

project0 mentioned this issue Dec 8, 2021

node-drainer isn't aware of LoadBalancer inflight requests causing 502s aws-samples/amazon-k8s-node-drainer#31

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 8, 2022

maximethebault mentioned this issue Nov 11, 2022

Interruption handling: handle "Rebalance Recommandation" events aws/karpenter-provider-aws#2813

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 11, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 11, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label May 13, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 26, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 27, 2024

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Apr 17, 2024

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 18, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 30, 2024

Getting 502/504 with Pod Readiness Gates during rolling updates #1719

Getting 502/504 with Pod Readiness Gates during rolling updates #1719

Comments

calvinbui commented Dec 10, 2020

AirbornePorcine commented Dec 11, 2020

M00nF1sh commented Dec 11, 2020

AirbornePorcine commented Dec 11, 2020

M00nF1sh commented Dec 11, 2020

AirbornePorcine commented Dec 11, 2020

calvinbui commented Dec 15, 2020

calvinbui commented Dec 24, 2020 • edited Loading

foriequal0 commented Jan 20, 2021 • edited Loading

k8s-triage-robot commented Sep 26, 2021

project0 commented Dec 8, 2021

k8s-triage-robot commented Mar 8, 2022

project0 commented Mar 8, 2022

ardove commented Mar 25, 2022 • edited Loading

albgus commented May 10, 2022

dfinucane commented Jul 15, 2022

aaron-hastings-travelport commented Jul 18, 2022 • edited Loading

jyotibhanot commented Aug 12, 2022

project0 commented Nov 22, 2022

project0 commented Nov 22, 2022

project0 commented Dec 7, 2022

smulikHakipod commented Jan 11, 2023

k8s-triage-robot commented Apr 11, 2023

k8s-triage-robot commented May 11, 2023

dongho-jung commented May 13, 2023 • edited Loading

ThisIsQasim commented May 13, 2023

rkubik-hostersi commented Jun 28, 2023

dusansusic commented Nov 22, 2023

OverStruck commented Nov 28, 2023

k8s-triage-robot commented Feb 26, 2024

k8s-triage-robot commented Mar 27, 2024

kdavh commented Apr 15, 2024

dickfickling commented Apr 17, 2024

adityapatadia commented Apr 26, 2024 • edited Loading

stepan-romankov commented May 21, 2024 • edited Loading

unifyapps-saleem commented Jun 24, 2024

stepan-romankov commented Jun 24, 2024

unifyapps-saleem commented Jun 24, 2024

veludcx commented Sep 19, 2024

k8s-triage-robot commented Dec 18, 2024

cbugneac-nex commented Dec 30, 2024

calvinbui commented Dec 24, 2020 •

edited

Loading

foriequal0 commented Jan 20, 2021 •

edited

Loading

ardove commented Mar 25, 2022 •

edited

Loading

aaron-hastings-travelport commented Jul 18, 2022 •

edited

Loading

dongho-jung commented May 13, 2023 •

edited

Loading

adityapatadia commented Apr 26, 2024 •

edited

Loading

stepan-romankov commented May 21, 2024 •

edited

Loading