-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can we find right value of sleep time for zero downtime rolling update? #2106
Comments
So it's not proper to set as 40 second, and we currently don't offer an optimal setting as well as there is a lot variants above. You should tune it according to your application and cluster usage. |
@M00nF1sh < additional questions > In case of rolling update, pod status will be changed like below.
More detail between "Terminating" -> "Terminated"
I set the preStop hook timeout with 150s(over 2minutes), but health check request continues until the nodejs application terminated with sigterm. Is it implemented that ALB deregister "terminating" status Pod in case using "instance target type"? I think there is no way for applications to aware that if they are in preStop hook, so they cannot restrict health check requests from ALB. Finally, the applications are terminated with no response for some requests. (If there is no graceful shutdown logic in applications) Couldn't I redeploy with zerodowntime using just k8s and ALB? What is the expected behaviour of k8s and ALB controller in preStop duration? The most important thing is that when ALB does not check the health for "terminating" pod any more. |
@M00nF1sh Above comment, I was confused why alb checks health of terminating pod. In our case, it is enough to prestop 5 seconds. Then ALB does not pass traffics to "terminating" pods anymore. I'm very thankful of your conversation. |
FWIW, if someone faces the same issue and stumbles upon this thread: we ran into the same issue and contacted AWS support. The statement was that it can indeed happen that after deregistration the ALB still might send new requests to the target. This should be compensated with a pre stop sleep - they recommended 60 secs to be on the safe side. with 60 secs, our load tests did not show any 502 errors during rolling upgrades. Also we were told that the issue (ALB sending requests to draining targets) should be fixed in the future, so we expect to eventually be able to decrease the pre stop sleep to a lower number. |
@MatthiasWinzeler : What should be the value of pre stop hook? Some posts suggest that terminationGracePeriodSeconds > preStop sleep > deregistration delay and other suggests that PreStop hook sleep only need to be controller process time + ELB API propagation time + HTTP req/resp RTT. How can we calculate the prestop hook value? |
@jyotibhanot We don't have long running requests (which I think would require to respect the deregistration delay). So for us only To figure it out for your use case, AWS recommended us to just test it with your applications under some realistic load. |
During a recent terraform apply in integration which rolled the k8s nodes, we saw a number of 502 / 503 responses from the load balancers. The theory is that this is due to kubernetes-sigs/aws-load-balancer-controller issue #2366 - pods and load balancers are updated at the same time, but load balancer updates don't happen instantly, so the load balancer may continue to send traffic to pods which are terminating. [A comment on another issue](kubernetes-sigs/aws-load-balancer-controller#2106 (comment)) suggests that 60 seconds is long enough to avoid any 502s, although the rest of the comments suggest "it depends" on various factors. Whatever, 15 seconds does not seem to be long enough for us to be able to roll our nodes without serving some 502s, so we should try a higher value. I guess higher values will result in slower deployments (not just node rollouts, but anything which requires pods to terminate and new pods to come up). 60 seconds still feels just about tolerable to me, but I don't think we'd want to go much higher than this.
Hello,
I'm using ALB controller v2.2.0 and Ingress with instance target type.
I want to know how to rollingupdate with zero downtime.
I investigated some tests.
According to the comments, is it not proper to set 40s? #1719 (comment), #1719 (comment)
(controller process time + ELB API propagation time + HTTP req/resp RTT + kubeproxy's iptable update time)
How can I find right value of sleep time?
Long sleep time means that the more containers can be run simultaneously, I think.
And, Is the sleep time not related with deregistration delay of target group?
(the target group used in above tests is set to 300s deregistration delay, but just 70 seconds sleep is enough to remove 502 errors)
Expected outcome
no downtime deployment without 502 errors.
Environment
Additional Context:
test script(macos) :
The text was updated successfully, but these errors were encountered: