Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot disrupt NodeClaim: nodeclaim does not have an associated node #7631

Open
IgorKurylo1988 opened this issue Jan 24, 2025 · 3 comments
Open
Labels
bug Something isn't working triage/needs-information Marks that the issue still needs more information to properly triage

Comments

@IgorKurylo1988
Copy link

IgorKurylo1988 commented Jan 24, 2025

Description

Observed Behavior:
Instance with gpu taints not started and the node not connected to the cluster
We have AMI GPU based on amazon-eks-gpu-node-1.30-*
That new install of the karpanter, we have other cluster with v0.32+ karpanter and there gpu works.

Expected Behavior:
Instance connected to cluster

Reproduction Steps (Please include YAML):

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu
spec:
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 24h
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: gpu
      expireAfter: Never
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values:
            - g5
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values:
            - 2xlarge
            - 4xlarge
        - key: kubernetes.io/os
          operator: In
          values:
            - linux
        - key: kubernetes.io/arch
          operator: In
          values:
            - amd64
        - key: karpenter.sh/capacity-type
          operator: In
          values:
            - on-demand
        - key: karpenter.k8s.aws/instance-gpu-manufacturer
          operator: In
          values: ["nvidia"]
      taints:
        - key: nvidia.com/gpu
          effect: "NoSchedule"
          value: "true"
---
apiVersion: karpenter.k8s.aws/v1
kind: EC2NodeClass
metadata:
  name: gpu
spec:
  amiFamily: AL2 # Amazon Linux 2
  instanceProfile: "sfly-aws-apc-dev-svc-eks-node-group-InstanceProfile"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery/standard-app-dev-common: "standard-app-dev-common"
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery/standard-app-dev-common: "standard-app-dev-common"
  amiSelectorTerms:
    - id: "ami-080bac37fb480fa75" - GPU AMI Based on  amazon-eks-gpu-node-1.30-v20250116
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 10000
        deleteOnTermination: true
        throughput: 125
  tags:
    Name: standard-app-dev-common-eks-gpu
    Environment: "dev"
    Provisioner: Karpenter
    ManagedBy: APC
    BusinessUnit: Consumer
    App: EKS
    Role: GPU Compute Node

Versions:

  • Chart Version: v1.1.1
  • Kubernetes Version (kubectl version): 1.30 - EKS AWS
  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@IgorKurylo1988 IgorKurylo1988 added bug Something isn't working needs-triage Issues that need to be triaged labels Jan 24, 2025
@rschalo
Copy link
Contributor

rschalo commented Jan 28, 2025

Does the instance launch at all? If so are you able to look at kubelet logs and check for failures there and can you please share the output?

@rschalo rschalo added triage/needs-information Marks that the issue still needs more information to properly triage and removed needs-triage Issues that need to be triaged labels Jan 29, 2025
Copy link
Contributor

This issue has been inactive for 14 days. StaleBot will close this stale issue after 14 more days of inactivity.

@allyrr
Copy link

allyrr commented Feb 25, 2025

Hello,

I encounter the same issue.

EKS version: v1.31.5-eks-8cce635
Karpenter ver.: 1.2.1

With AL2 family configuration in EC2NodeClass

spec:
  amiFamily: AL2
  amiSelectorTerms:
  - id: ami-06806e88f71fcc3d2

got error:

Image
k describe NodeClaim dev-g4dn-test-zxd6r
...
Events:
  Type    Reason             Age    From       Message
  ----    ------             ----   ----       -------
  Normal  Launched           3m28s  karpenter  Status condition transitioned, Type: Launched, Status: Unknown -> True, Reason: Launched
  Normal  DisruptionBlocked  3m27s  karpenter  Nodeclaim does not have an associated node
  Normal  Registered         2m43s  karpenter  Status condition transitioned, Type: Registered, Status: Unknown -> True, Reason: Registered
  Normal  DisruptionBlocked  85s    karpenter  Node isn't initialized

Despite the fact that NodeClaims is Unknown the node is in Ready state in EKS cluster.

With Bottlerocket there is no such issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage/needs-information Marks that the issue still needs more information to properly triage
Projects
None yet
Development

No branches or pull requests

3 participants