-
Notifications
You must be signed in to change notification settings - Fork 98
Description
The controller
has no dependence on GPUs and is forced to run on a control-plane node (preferably without GPUs).
When someone however is
running on a node with the nvidia-container-runtime installed and set as the default runtime for containerd (even though there are no GPUs on the node)
then it can happen that
the nvidia-container-runtime will trigger the nvidia-container-cli which, in turn, will attempt to load NVML to look for „all“ GPUs on the node (which will then fail for obvious
reasons).
That's not a system configuration that we expect, and users might see something like
runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown
which may be difficult to debug.
For a better user experience, @klueska noted that we should set
NVIDIA_VISIBLE_DEVICES=void in our helm template for the controller.
(quotes from Slack).
Metadata
Metadata
Assignees
Labels
Type
Projects
Status