Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay pod deletion to handle deregistraton delay #1775

Closed
wants to merge 1 commit into from
Closed

Delay pod deletion to handle deregistraton delay #1775

wants to merge 1 commit into from

Conversation

foriequal0
Copy link

@foriequal0 foriequal0 commented Jan 23, 2021

Pods are deleted during Deployment rollout and then removed from
Endpoints. When deleted, the controller deregisters from TargetGroups
and then the ELB drains the traffic. The delay between Pod deletion and
the propagation of draining is known to be the cause of ELB 5xx error.
ELB tries to send traffic to Pods that are tereminated without knowing
that they are terminated until Pods are completely deregistered.

We solve this problem by keeping Pods alive for a delay time. There is a
well-known solution that attaches a "sleep" command to preStop lifecycle
hook to keep them alive for the delay. However, this is an ugly
mitigation like duct tape, and it might be difficult to hook up a "sleep
9999" to some Deployments depending on their configurations.

In this commit, we use ValidatingAdmissionWebhook to be notified about
Pods that need be deleted, and block their immediate deletions. By doing
so, we can do whatever we want with them while keeping them alive. We
remove all labels from them, so they are removed from the Endpoints, and
are deregistered from TargetGroups. We also remove ownerReferences so it
doesn't kick GC in. Most importantly, they are still alive so they can
continue to serve incoming new traffics received during the delay. After
letting them alive for the delay, we can safely delete them as there
shouldn't be any new traffic coming in.

You can test with the prebuilt docker image: https://github.com/users/foriequal0/packages/container/aws-load-balancer-controller/901501 I've also forked eks-charts/aws-load-balancer-controller https://github.com/foriequal0/eks-charts/tree/drain/stable/aws-load-balancer-controller You can use this helm chart value: ```yaml podGracefulDrainDelay: 90s image: repository: ghcr.io/foriequal0/aws-load-balancer-controller tag: v2.1.1-2-geb716265 ```

Please see https://github.com/foriequal0/pod-graceful-drain

closes: #1719
closes: #1065
related?: #1403

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 23, 2021
@k8s-ci-robot
Copy link
Contributor

Welcome @foriequal0!

It looks like this is your first PR to kubernetes-sigs/aws-load-balancer-controller 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/aws-load-balancer-controller has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @foriequal0. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 23, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: foriequal0
To complete the pull request process, please assign after the PR has been reviewed.
You can assign the PR to them by writing /assign in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@codecov-io
Copy link

codecov-io commented Jan 23, 2021

Codecov Report

Merging #1775 (eb71626) into main (3807a2d) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1775   +/-   ##
=======================================
  Coverage   46.76%   46.76%           
=======================================
  Files         110      110           
  Lines        5988     5988           
=======================================
  Hits         2800     2800           
  Misses       2925     2925           
  Partials      263      263           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3807a2d...eb71626. Read the comment docs.

@foriequal0 foriequal0 changed the title Delay pod deletion to handle deregistraton delay [WIP] Delay pod deletion to handle deregistraton delay Jan 23, 2021
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 23, 2021
Pods are deleted during Deployment rollout and then removed from
Endpoints. When deleted, the controller deregisters from TargetGroups
and then the ELB drains the traffic. The delay between Pod deletion and
the propagation of draining is known to be the cause of ELB 5xx error.
ELB tries to send traffic to Pods that are tereminated without knowing
that they are terminated until Pods are completely deregistered.

We solve this problem by keeping Pods alive for a delay time. There is a
well-known solution that attaches a "sleep" command to preStop lifecycle
hook to keep them alive for the delay. However, this is an ugly
mitigation like duct tape, and it might be difficult to hook up a "sleep
9999" to some Deployments depending on their configurations.

In this commit, we use ValidatingAdmissionWebhook to be notified about
Pods that need be deleted, and block their immediate deletions. By doing
so, we can do whatever we want with them while keeping them alive. We
remove all labels from them, so they are removed from the Endpoints, and
are deregistered from TargetGroups. We also remove ownerReferences so it
doesn't kick GC in. Most importantly, they are still alive so they can
continue to serve incoming new traffics received during the delay. After
letting them alive for the delay, we can safely delete them as there
shouldn't be any new traffic coming in.
@foriequal0 foriequal0 changed the title [WIP] Delay pod deletion to handle deregistraton delay Delay pod deletion to handle deregistraton delay Jan 25, 2021
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 25, 2021
@foriequal0
Copy link
Author

foriequal0 commented Jan 26, 2021

I think we shouldn't remove all labels. Should I remove only labels that are defined by the service's selector? Also shoud I add support an option to delay them until the draining is fully finished (deregistration_delay. timeout_seconds?) if the app is not support connection draining? Anyway, it works for me perfectly at this point.
Also I found that this sentence in the comment on the service's selector: "If empty or not present, the service is assumed to have an external process managing its endpoints, which Kubernetes will not modify." In case of it, should we directly edit the endpoint rather than editing the pod's label?

@foriequal0
Copy link
Author

foriequal0 commented Jan 26, 2021

I found that kubectl drain [node], which uses eviction api bypasses the admission so the pod will deleted immediately. The same API is used by aws-node-termination-handler. But curiously the label is changed after that. And also kubectl drain [node] --disable-eviction, which uses normal pod delete doesn't continue to delete remaining pods at the first failure :/.

I'm thinking this is more hackisher than I thought. I should make this as a separate project.

@mihasya
Copy link

mihasya commented Jan 27, 2021

Came here after several days of trying to configure the this thing JUST SO and still getting handfulls of 500s on every deploy. Thank you for this work, I really hope a clean way to make this work can be figured out. Of course, drain node is an extremely important use-case. If you come up with a solution and need help testing, I would gladly run this PR on our staging environment. Thanks for digging into this!

@foriequal0
Copy link
Author

foriequal0 commented Feb 5, 2021

@mihasha I've made this a seperated package here https://github.com/foriequal0/pod-graceful-drain
Can you test that? If you have any question, please leave an issue to there.
I'll close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
4 participants