Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) #7634

balusarakesh · 2025-01-24T17:45:05Z

Description

Observed Behavior: We noticed that a node in EC2 is not reachable and failing health checks but karpenter is not terminating the node

Expected Behavior: Karpenter should terminate the node if it is not reachable

Reproduction Steps (Please include YAML): Not sure how since the node failed health checks

currently the node has the following 2 taints

node.kubernetes.io/unreachable:NoSchedule
node.kubernetes.io/unreachable:NoExecute

I can also confirm that in EC2 console I see that the instance failed health checks like 6 hours ago
most of the pods on the node are stuck in Terminating state
there are no logs related to this node/nodeClaim in karpenter even after enabling debug logs

Versions:

Chart Version: v0.34.0
Kubernetes Version (kubectl version): 1.29.8

The text was updated successfully, but these errors were encountered:

rschalo · 2025-01-28T15:32:21Z

Would node repair address your issue here? https://docs.aws.amazon.com/eks/latest/userguide/node-health.html

That said, I don't believe that it was back ported or that there are plans to do so. Karpenter responds to these events starting in v1.1. @engedaam to confirm.

github-actions · 2025-02-13T12:05:44Z

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

jonathan-innis · 2025-02-21T13:52:45Z

Outside of NodeRepair, I think we should check if there are gaps in NodeRepair that would cause us to need to periodically poll instances for health checks. It shouldn't be particularly difficult to hook into these checks, it's mostly a question about whether this is necessary or not given the NodeRepair feature.

One thing that I could maybe think is that a health check failure on the EC2 instance could be acted on much quicker than the optimistic NotReady check that requires waiting 30m before terminating the node.

Also, marking this as a feature because (to me) this is about doing a health check integration for EC2 instances in the AWS provider.

balusarakesh added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 24, 2025

rschalo added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Jan 29, 2025

github-actions bot added the lifecycle/stale label Feb 13, 2025

jonathan-innis added feature New feature or request help-wanted Extra attention is needed and removed bug Something isn't working lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage labels Feb 21, 2025

jonathan-innis changed the title ~~karpenter not terminating nodes in AWS that failed health checks and unreachable~~ Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) Feb 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) #7634

Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) #7634

balusarakesh commented Jan 24, 2025

rschalo commented Jan 28, 2025

github-actions bot commented Feb 13, 2025

jonathan-innis commented Feb 21, 2025 •

edited

Loading

Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) #7634

Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) #7634

Comments

balusarakesh commented Jan 24, 2025

Description

rschalo commented Jan 28, 2025

github-actions bot commented Feb 13, 2025

jonathan-innis commented Feb 21, 2025 • edited Loading

jonathan-innis commented Feb 21, 2025 •

edited

Loading