Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) #7634

Open
balusarakesh opened this issue Jan 24, 2025 · 3 comments
Labels
feature New feature or request help-wanted Extra attention is needed

Comments

@balusarakesh
Copy link

Description

Observed Behavior: We noticed that a node in EC2 is not reachable and failing health checks but karpenter is not terminating the node

Expected Behavior: Karpenter should terminate the node if it is not reachable

Reproduction Steps (Please include YAML): Not sure how since the node failed health checks

  • currently the node has the following 2 taints
node.kubernetes.io/unreachable:NoSchedule
node.kubernetes.io/unreachable:NoExecute
  • I can also confirm that in EC2 console I see that the instance failed health checks like 6 hours ago
  • most of the pods on the node are stuck in Terminating state
  • there are no logs related to this node/nodeClaim in karpenter even after enabling debug logs

Versions:

  • Chart Version: v0.34.0
  • Kubernetes Version (kubectl version): 1.29.8
@balusarakesh balusarakesh added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 24, 2025
@rschalo
Copy link
Contributor

rschalo commented Jan 28, 2025

Would node repair address your issue here? https://docs.aws.amazon.com/eks/latest/userguide/node-health.html

That said, I don't believe that it was back ported or that there are plans to do so. Karpenter responds to these events starting in v1.1. @engedaam to confirm.

@rschalo rschalo added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Jan 29, 2025
Copy link
Contributor

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

@jonathan-innis
Copy link
Contributor

jonathan-innis commented Feb 21, 2025

Outside of NodeRepair, I think we should check if there are gaps in NodeRepair that would cause us to need to periodically poll instances for health checks. It shouldn't be particularly difficult to hook into these checks, it's mostly a question about whether this is necessary or not given the NodeRepair feature.

One thing that I could maybe think is that a health check failure on the EC2 instance could be acted on much quicker than the optimistic NotReady check that requires waiting 30m before terminating the node.

Also, marking this as a feature because (to me) this is about doing a health check integration for EC2 instances in the AWS provider.

@jonathan-innis jonathan-innis added feature New feature or request help-wanted Extra attention is needed and removed bug Something isn't working lifecycle/stale triage/needs-information Marks that the issue still needs more information to properly triage labels Feb 21, 2025
@jonathan-innis jonathan-innis changed the title karpenter not terminating nodes in AWS that failed health checks and unreachable Integrate Karpenter with AWS Health Checks (EC2 instances, EBS volumes, etc.) Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request help-wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants