-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document zero-downtime deployment for IP targets #2131
Comments
/kind documentation |
@kishorj Is there a timeline you're targeting to document how to achieve zero-downtime deployments? If not, could you please give some pointers on how this can be achieved? Looking at the related issues filed, the solutions mostly are around adding a sleep in preStop step. I'd really appreciate if you could share your recommendation. |
Found this in documentation: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.2/deploy/pod_readiness_gate This talks about a deploy scenario where service can have an outage. Will give this a try today and see if it solves for my case. |
Enabling Pod Readiness Gate reduced the 5xx errors, but did not completely eliminate them. Found this issue #1719 (comment) where @M00nF1sh has explained the breakup of things to consider while deciding the preStop sleep value. After setting an appropriate value in preStop, I'm able to deploy without any errors. It was also suggested in one of the issues to enable Graceful Shutdown in the server, but I found that if the preStop sleep is high enough, not doing graceful shutdown is also fine since the pod will get fully deregistered from LB during the sleep phase itself. So by the time server receives TERM signal, LB itself would've stopped sending new requests to the pod (and in-flight requests would have also got over). But still good to enable it in case there are any other edge cases. |
I did create an article about this a while back. https://aws.plainenglish.io/6-tips-to-improve-availability-with-aws-load-balancers-and-kubernetes-ad8d4d1c0f61
|
@keperry Thanks for sharing, that was very helpful. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
I haven't got it working yet. Just a simply replacement of the pod (for example change from image: nginx to image: httpd) still causes some connections to drop.
Testing with
Version 2.4, EKS 1.20 |
@sjmiller609 - are you signalling to the readiness probe to no longer take traffic by throwing a 500 during the "shutdown wait" period? I can't quite tell if your app is doing that. It looks like the "sleep" is handling the "shutdown wait", but if nothing signals to readiness probe (readiness probe must fail), kube will keep sending traffic there. Additionally, I would explicitly set your timeout for readiness probe. |
Thanks, I think this is what I'm missing. I will give this a shot right now! |
I'm giving this a go, but i'm not sure it's quite right because I think you are saying the workload should continue serving regular traffic, just not the readiness probe
|
Since I will have to work out details in the workload, I will replace my demo service by my actual ingress controller and then report back. |
I think the intended order of events is:
It seems like in my case, my workload can just sleep for 180 seconds, and doesn't need to be customized for the readiness probe. It's just about waiting long enough to satisfy the limitation of the AWS NLB.
I'm trying to understand the purpose of @keperry 's suggestion, and I am guessing the reasoning is that by setting readiness to fail, then AWS LB controller will then mark the target as unhealthy (not sure?). Then this satisfies the condition in the above quote to "ensure that the instance is unhealthy before you deregister it" References:
Other notes:
I will post my manifests below that I used to get it working in my case. |
Not shown:
The below manifests were working in my test to run the monitoring script and do a "kubectl rollout restart deployments -n istio-system". I think they are not the minimal configuration. Istio configuration:
Configuration of istio
Nginx
|
@sjmiller609 tldr; checkout this workaround: #1719 (comment) |
Update, this configuration has been working perfectly for a few weeks:
|
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
/remove-lifecycle stale |
Any way to define |
unfortunately not, we used kustomize on top of helm (macgyver solution?) |
@woehrl01 Could you elaborate on your setup with distroless Istio proxies, as I don't see a way to achieve it without some form of preStop? See my comment here istio/istio#47265 (comment) But I don't see how |
@clayvan you're right and I apologies for not updating on this thread. Even though the config I mentioned above does work a few times, it's not reliable to achieve a zero downtime on AWS with NLB. The only way we archived this is by injecting the already mentioned prestop hook. |
@hariomsaini , this solution might not work when using Istio Gateway Helmchart. Because the pipeline operator in Istio refers to an object. The following patch is for customizing the Istio Deployment apiVersion: builtin
kind: PatchTransformer
metadata:
name: patch-graceful-shutdown
target:
kind: IstioOperator
patch: |
- op: add
path: /spec/components/ingressGateways/0/k8s/overlays/0/patches/-
value:
path: spec.template.metadata.annotations.proxy\.istio\.io/config
value: |
drainDuration: 360s
parentShutdownDuration: 361s
terminationDrainDuration: 362s The manifest of Istio Operator will now be apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
name: default-istiocontrolplane
namespace: istio-system
spec:
components:
ingressGateways:
- enabled: true
k8s:
hpaSpec:
maxReplicas: 30
minReplicas: 3
overlays:
- kind: Deployment
name: istio-ingressgateway
patches:
- path: spec.template.metadata.annotations.proxy\.istio\.io/config
value: |
drainDuration: 360s
parentShutdownDuration: 361s
terminationDrainDuration: 362s This is an invalid manifest. Because |
Did anyone achieve the zero downtime with instance target type? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
FWIW there's this documentation, which is the most "complete" I'm aware of. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Is your feature request related to a problem?
Document setting up zero-downtime deployment with AWS Load balancer controller.
Describe the solution you'd like
A documentation with the detailed steps.
Describe alternatives you've considered
N/A
The text was updated successfully, but these errors were encountered: