In the main chart, both the neuron-device-plugin and nvidia-device-plugin are set to enabled: true. The nvidia-device-plugin has some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
- key: sagemaker.amazonaws.com/node-health-status
operator: Equal
value: Unschedulable
effect: NoSchedule
This logic doesn't exist for the neuron-device-plugin in the main values.yaml file.
Digging a bit deeper, I see that the neuron-device-plugin has it's own subdirectory for charts. These charts have a much more thorough definition of which nodes to schedule the neuron-device-plugin pods on (i.e., schedule only on the instances that match the nodeAffinity defined, which is essentially the Neuron based instances).
This is quite an odd definition for the helm charts. The nvidia-device-plugin runs no matter what -- it will schedule one per node, regardless of node type, because there's no nodeAffinity defined that makes it get deployed on specific (GPU) nodes types only. It will schedule on a node as long as the node has passed the health check (even though it doesn't have the nvidia.com/gpu label).
Because of this, nvidia-device-plugin pods get scheduled on non-GPU instances and go into CrashLoopBackOff state. It also tries to restart the pod multiple times. Relevant error:
I0127 15:32:51.363463 1 main.go:317] Retrieving plugins.
E0127 15:32:51.363574 1 factory.go:87] Incompatible strategy detected auto
E0127 15:32:51.363585 1 factory.go:88] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0127 15:32:51.363590 1 factory.go:89] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0127 15:32:51.363595 1 factory.go:90] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0127 15:32:51.363600 1 factory.go:91] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
E0127 15:32:51.372916 1 main.go:149] error starting plugins: error creating plugin manager: unable to create plugin manager: invalid device discovery strategy
stream closed EOF for kube-system/hyperpod-dependencies-nvidia-device-plugin-hcmdh (nvidia-device-plugin-ctr)
Can we implement a similar nodeAffinity definition for the nvidia-device-plugin too?
In the main chart, both the
neuron-device-pluginandnvidia-device-pluginare set toenabled: true. Thenvidia-device-pluginhas some additional logic that supposedly only schedules the ds on specific nodes (that match the tolerations):This logic doesn't exist for the
neuron-device-pluginin the main values.yaml file.Digging a bit deeper, I see that the
neuron-device-pluginhas it's own subdirectory for charts. These charts have a much more thorough definition of which nodes to schedule theneuron-device-pluginpods on (i.e., schedule only on the instances that match thenodeAffinitydefined, which is essentially the Neuron based instances).This is quite an odd definition for the helm charts. The
nvidia-device-pluginruns no matter what -- it will schedule one per node, regardless of node type, because there's nonodeAffinitydefined that makes it get deployed on specific (GPU) nodes types only. It will schedule on a node as long as the node has passed the health check (even though it doesn't have thenvidia.com/gpulabel).Because of this,
nvidia-device-pluginpods get scheduled on non-GPU instances and go intoCrashLoopBackOffstate. It also tries to restart the pod multiple times. Relevant error:Can we implement a similar
nodeAffinitydefinition for thenvidia-device-plugintoo?