Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors.

In the [main chart](https://github.com/aws/sagemaker-hyperpod-cli/blob/main/helm_chart/HyperPodHelmChart/values.yaml#L131-L157), both the `neuron-device-plugin` and `nvidia-device-plugin` are set to `enabled: true`. The `nvidia-device-plugin` has some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):
```
  tolerations:
    - key: nvidia.com/gpu
      operator: Exists
      effect: NoSchedule
    - key: sagemaker.amazonaws.com/node-health-status
      operator: Equal
      value: Unschedulable
      effect: NoSchedule
```
This logic doesn't exist for the `neuron-device-plugin` in the main values.yaml file. 

Digging a bit deeper, I see that the `neuron-device-plugin` has it's own [subdirectory](https://github.com/aws/sagemaker-hyperpod-cli/tree/d7cbc7739e08d9ec50f1d67cdc7980ad1cc48a1f/helm_chart/HyperPodHelmChart/charts/neuron-device-plugin) for charts. These charts have a much more thorough definition of which nodes to schedule the `neuron-device-plugin` pods on (i.e., [schedule only on the instances that match the `nodeAffinity` defined](https://github.com/aws/sagemaker-hyperpod-cli/blob/d7cbc7739e08d9ec50f1d67cdc7980ad1cc48a1f/helm_chart/HyperPodHelmChart/charts/neuron-device-plugin/templates/k8s-neuron-device-plugin.yaml#L60-L78), which is essentially the Neuron based instances). 

This is quite an odd definition for the helm charts. The `nvidia-device-plugin` runs no matter what -- it **will** schedule one per node, regardless of node type, because there's no `nodeAffinity` defined that makes it get deployed on specific (GPU) nodes types only. It **will** schedule on a node as long as the node has passed the health check (even though it doesn't have the `nvidia.com/gpu` label). 

Because of this, `nvidia-device-plugin` pods get scheduled on non-GPU instances and go into `CrashLoopBackOff` state. It also tries to restart the pod multiple times. Relevant error:
```
I0127 15:32:51.363463       1 main.go:317] Retrieving plugins.
E0127 15:32:51.363574       1 factory.go:87] Incompatible strategy detected auto
E0127 15:32:51.363585       1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0127 15:32:51.363590       1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0127 15:32:51.363595       1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0127 15:32:51.363600       1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0127 15:32:51.372916       1 main.go:149] error starting plugins: error creating plugin manager: unable to create plugin manager: invalid device discovery strategy
stream closed EOF for kube-system/hyperpod-dependencies-nvidia-device-plugin-hcmdh (nvidia-device-plugin-ctr)
```

Can we implement a similar `nodeAffinity` definition for the `nvidia-device-plugin` too?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors. #48

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to CrashLoopBackOff errors. #48

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Helm chart definition makes it so that nvidia-device-plugin is scheduled no matter what, leading to `CrashLoopBackOff` errors. #48