Skip to content

controller: run with NVIDIA_VISIBLE_DEVICES=void #402

@jgehrcke

Description

@jgehrcke

The controller

has no dependence on GPUs and is forced to run on a control-plane node (preferably without GPUs).

When someone however is

running on a node with the nvidia-container-runtime installed and set as the default runtime for containerd (even though there are no GPUs on the node)

then it can happen that

the nvidia-container-runtime will trigger the nvidia-container-cli which, in turn, will attempt to load NVML to look for „all“ GPUs on the node (which will then fail for obvious
reasons).

That's not a system configuration that we expect, and users might see something like

runc create failed: unable to start container process: error during container init: error running prestart hook #0: exit status 1, stdout: , stderr: nvidia-container-cli: initialization error: nvml error: driver not loaded: unknown

which may be difficult to debug.

For a better user experience, @klueska noted that we should set

NVIDIA_VISIBLE_DEVICES=void in our helm template for the controller.

(quotes from Slack).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Closed

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions