Skip to content

Container Toolkit messed up config.toml in GKE cluster with containerd 2.0 #1222

@quoctruong

Description

@quoctruong

I've been using the GPU Operator to install Nvidia driver in my GKE cluster. This has worked fine until last week when the cluster version is upgraded from GKE 1.32 to 1.33. One of the big change is that the containerd version is now 2.0. Now, a lot of GPU operator Daemonset is failing with the following errors:

"failed to \"CreatePodSandbox\" for \"gpu-feature-discovery-nv7f9_gpu-operator(b07a2405-29b1-412e-8400-a1367a10e76d)\" with CreatePodSandboxError: \"Failed to create sandbox for pod \\\"gpu-feature-discovery-nv7f9_gpu-operator(b07a2405-29b1-412e-8400-a1367a10e76d)\\\": rpc error: code = Unknown desc = failed to setup network for sandbox \\\"2b62cb2980cd477cefbea1155871bcd2f50b2c6fb3e922039554d8b3fc14784f\\\": plugin type=\\\"gke\\\" failed (add): failed to find plugin \\\"gke\\\" in path [/opt/cni/bin]\"

So we went and looked at the config.toml file and it looks like it was modified and added an incorrect line of config:

    [plugins."io.containerd.cri.v1.runtime".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"

This is most likely because of the nvidia-container-toolkit-daemonset because it was the last pod to be able start successfully (and also it looks like the toolkit modifies this file?).

The toolkit version is from the container nvcr.io/nvidia/k8s/container-toolkit:v1.17.8-ubuntu20.04 and the GPU operator is version 25.3.2 (the latest one).

Any insight on what is happening and why the config.toml was modified incorrectly for containerd 2.0?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions