-
Notifications
You must be signed in to change notification settings - Fork 417
Description
I've been using the GPU Operator to install Nvidia driver in my GKE cluster. This has worked fine until last week when the cluster version is upgraded from GKE 1.32 to 1.33. One of the big change is that the containerd version is now 2.0. Now, a lot of GPU operator Daemonset is failing with the following errors:
"failed to \"CreatePodSandbox\" for \"gpu-feature-discovery-nv7f9_gpu-operator(b07a2405-29b1-412e-8400-a1367a10e76d)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"gpu-feature-discovery-nv7f9_gpu-operator(b07a2405-29b1-412e-8400-a1367a10e76d)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"2b62cb2980cd477cefbea1155871bcd2f50b2c6fb3e922039554d8b3fc14784f\\\": plugin type=\\\"gke\\\" failed (add): failed to find plugin \\\"gke\\\" in path [/opt/cni/bin]\"
So we went and looked at the config.toml
file and it looks like it was modified and added an incorrect line of config:
[plugins."io.containerd.cri.v1.runtime".cni]
bin_dir = "/opt/cni/bin"
conf_dir = "/etc/cni/net.d"
This is most likely because of the nvidia-container-toolkit-daemonset
because it was the last pod to be able start successfully (and also it looks like the toolkit modifies this file?).
The toolkit version is from the container nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04
and the GPU operator is version 25.3.2 (the latest one).
Any insight on what is happening and why the config.toml
was modified incorrectly for containerd 2.0?