controller: run with `NVIDIA_VISIBLE_DEVICES=void`

The controller

> has no dependence on GPUs and is forced to run on a control-plane node (preferably without GPUs).

When someone however is 

> running on a node with the nvidia-container-runtime installed and set as the default runtime for containerd (even though there are no GPUs on the node)

then it can happen that

>  the nvidia-container-runtime will trigger the nvidia-container-cli which, in turn, will attempt to load NVML to look for „all“ GPUs on the node (which will then fail for obvious 
reasons).

That's not a system configuration that we expect, and users might see something like
> runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown

which may be difficult to debug.

For a better user experience, @klueska noted that we should set

> NVIDIA_VISIBLE_DEVICES=void in our helm template for the controller.

(quotes from Slack).



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

controller: run with `NVIDIA_VISIBLE_DEVICES=void` #402

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

controller: run with NVIDIA_VISIBLE_DEVICES=void #402

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

controller: run with `NVIDIA_VISIBLE_DEVICES=void` #402