Skip to content

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to CrashLoopBackOff errors. #48

@amanshanbhag

Description

@amanshanbhag

In the main chart, both the neuron-device-plugin and nvidia-device-plugin are set to enabled: true. The nvidia-device-plugin has some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):

  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
    - key: sagemaker.amazonaws.com/node-health-status
      operator: Equal
      value: Unschedulable
      effect: NoSchedule

This logic doesn't exist for the neuron-device-plugin in the main values.yaml file.

Digging a bit deeper, I see that the neuron-device-plugin has it's own subdirectory for charts. These charts have a much more thorough definition of which nodes to schedule the neuron-device-plugin pods on (i.e., schedule only on the instances that match the nodeAffinity defined, which is essentially the Neuron based instances).

This is quite an odd definition for the helm charts. The nvidia-device-plugin runs no matter what -- it will schedule one per node, regardless of node type, because there's no nodeAffinity defined that makes it get deployed on specific (GPU) nodes types only. It will schedule on a node as long as the node has passed the health check (even though it doesn't have the nvidia.com/gpu label).

Because of this, nvidia-device-plugin pods get scheduled on non-GPU instances and go into CrashLoopBackOff state. It also tries to restart the pod multiple times. Relevant error:

I0127 15:32:51.363463       1 main.go:317] Retrieving plugins.
E0127 15:32:51.363574       1 factory.go:87] Incompatible strategy detected auto
E0127 15:32:51.363585       1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0127 15:32:51.363590       1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0127 15:32:51.363595       1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0127 15:32:51.363600       1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0127 15:32:51.372916       1 main.go:149] error starting plugins: error creating plugin manager: unable to create plugin manager: invalid device discovery strategy
stream closed EOF for kube-system/hyperpod-dependencies-nvidia-device-plugin-hcmdh (nvidia-device-plugin-ctr)

Can we implement a similar nodeAffinity definition for the nvidia-device-plugin too?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions