Skip to content

GPU crashing on 1 node. #1628

@ryanm101

Description

@ryanm101
NAME   STATUS   ROLES                                       AGE     VERSION        INTERNAL-IP     EXTERNAL-IP   OS-IMAGE                           KERNEL-VERSION          CONTAINER-RUNTIME
nuc1   Ready    control-plane,etcd,master,worker            2y40d   v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 38 (Server Edition)   6.5.6-200.fc38.x86_64   containerd://1.7.6-k3s1.26
nuc2   Ready    control-plane,coral.ai,etcd,master,worker   127m    v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 39 (Server Edition)   6.6.2-201.fc39.x86_64   containerd://1.7.6-k3s1.26
nuc3   Ready    control-plane,etcd,master,worker            42d     v1.26.9+k3s1   x.x.x.x   <none>        Fedora Linux 38 (Server Edition)   6.5.8-200.fc38.x86_64   containerd://1.7.6-k3s1.26

Running 3 master nodes using k3s
NUC 1 & 3 both deploy fine.
NUC 2 the container crashes with

E1216 11:45:32.208374       1 manager.go:146] Failed to serve gpu.intel.com/i915: rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing: dial unix /var/lib/kubelet/device-plugins/kubelet.sock: connect: permission denied"
Cannot register to kubelet service
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).registerWithKubelet
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:352
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).setupAndServe
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:280
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*server).Serve
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/server.go:207
github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin.(*Manager).handleUpdate.func1
	/go/src/github.com/intel/intel-device-plugins-for-kubernetes/pkg/deviceplugin/manager.go:144
runtime.goexit
	/usr/local/go/src/runtime/asm_amd64.s:1598

command used to provision NUC2:

curl -sfL https://get.k3s.io | K3S_URL=https://cluster.domain:6443 K3S_TOKEN=1:server:1 INSTALL_K3S_VERSION=v1.26.9+k3s1 sh -s - server --flannel-backend=none --disable-network-policy --cluster-cidr=x.x.x.x/x --service-cidr=x.x.x.x/x --cluster-init --disable=servicelb --disable traefik --selinux

The only differences between NUC2 and NUC1/3 are:

  1. NUC2 is FC39 and the others are FC38
  2. When starting k3s on NUC2 it complained about selinux and said to add '--selinux' to the startup command (the other two nodes dont have this)

Any advice appreciated.
I will test re-adding the node without the --selinux and if all else fails change it to FC38.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions