We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In a kubespray-cluster with a single control-plane:
When rebooting a worker-node (without any control-plane-components), the node does not get ready again.
The worker-node gets ready again after the reboot.
Deploy a kubespray-cluster with a single control-plane.
Reboot a worker-node without draining it before.
Linux 6.8.0-39-generic x86_64 PRETTY_NAME="Ubuntu 24.04 LTS" NAME="Ubuntu" VERSION_ID="24.04" VERSION="24.04 LTS (Noble Numbat)" VERSION_CODENAME=noble ID=ubuntu ID_LIKE=debian HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" UBUNTU_CODENAME=noble LOGO=ubuntu-logo
We run ansible via gitlab-ci with quay.io/kubespray/kubespray:v2.25.0, so the versions are:
quay.io/kubespray/kubespray:v2.25.0
ansible [core 2.16.7] config file = /builds/reddoxx/operations/provisioning/anything-on-pmc/ansible.cfg configured module search path = ['/builds/reddoxx/operations/provisioning/anything-on-pmc/library', '/usr/share/ansible'] ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3) jinja version = 3.1.4 libyaml = True
python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3)
7e0a40725 (which is v2.25.0)
7e0a40725
v2.25.0
calico
https://gist.github.com/rdxmb/099f6ebd3979369f059a1efdc18f0ec2
ansible-playbook -i $INVENTORY /kubespray/cluster.yml
--- anything is ok here, so I do not post the output ---
For me it seems to be kind of a hen's egg problem:
kubelet cannot connect the apiserver via localhost:6443, where the nginx-proxy-node-[n] should run and route to the kubernetes-apiserver.
localhost:6443
nginx-proxy-node-[n]
nginx-proxy-node-[n] cannot get ready because kubelet is not working correctly ...
root@node-8:~# systemctl status kubelet | tail Aug 07 16:34:59 node-8 kubelet[917]: E0807 16:34:59.146019 917 controller.go:146] "Failed to ensure lease exists, will retry" err="Get \"https://localhost:6443/apis/coordination.k8s.io/v1/namespaces/kube-node-lease/leases/node-8?timeout=10s\": dial tcp 127.0.0.1:6443: connect: connection refused" interval="7s" Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.193916 917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.566827 917 kubelet_node_status.go:352] "Setting node annotation to enable volume controller attach/detach" Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568002 917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasSufficientMemory" Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568047 917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasNoDiskPressure" Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568062 917 kubelet_node_status.go:669] "Recording event message for node" node="node-8" event="NodeHasSufficientPID" Aug 07 16:34:59 node-8 kubelet[917]: I0807 16:34:59.568089 917 kubelet_node_status.go:70] "Attempting to register node" node="node-8" Aug 07 16:34:59 node-8 kubelet[917]: E0807 16:34:59.568756 917 kubelet_node_status.go:92] "Unable to register node with API server" err="Post \"https://localhost:6443/api/v1/nodes\": dial tcp 127.0.0.1:6443: connect: connection refused" node="node-8" Aug 07 16:35:00 node-8 kubelet[917]: I0807 16:35:00.193654 917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused Aug 07 16:35:01 node-8 kubelet[917]: I0807 16:35:01.193301 917 csi_plugin.go:913] Failed to contact API server when waiting for CSINode publishing: Get "https://localhost:6443/apis/storage.k8s.io/v1/csinodes/node-8": dial tcp 127.0.0.1:6443: connect: connection refused
root@node-8:~# grep server /etc/kubernetes/kubelet.conf server: https://localhost:6443
root@node-8:~# crictl pods | tail fb128f73279b0 21 hours ago NotReady prometheus-prometheus-0 reddoxx-cloud-wharf 0 (default) c9b63478fcff3 21 hours ago NotReady max-map-count-setter-fjs9s rdx-node-bootstrap-sysctl 0 (default) c77504859d18e 21 hours ago NotReady kube-prometheus-stack-prometheus-node-exporter-g4vsp kube-prometheus-stack 0 (default) 772c45953bc1c 21 hours ago NotReady csi-rbdplugin-g4rqx ceph-csi 0 (default) 491f1658497c3 21 hours ago NotReady minio-operator-7cbcd9b458-t4w6j minio-operator 0 (default) 3ed2f5b5913a9 21 hours ago NotReady kustomize-controller-54df4985d-b2rbg flux-system 0 (default) 093aedf6b4a90 22 hours ago NotReady nodelocaldns-jhxcc kube-system 0 (default) 8a8f493601caa 22 hours ago NotReady calico-node-dx7fj kube-system 0 (default) 6b1ab648b8e7b 22 hours ago NotReady kube-proxy-vvdvk kube-system 0 (default) 16e3dc577108c 22 hours ago NotReady nginx-proxy-node-8 kube-system 0 (default)
There is also a backup-file created by kubespray with the correct server-ip included:
root@node-8:~# diff /etc/kubernetes/kubelet.conf /etc/kubernetes/kubelet.conf.5151.2024-08-06@18\:32\:01~ server: https://localhost:6443 | server: https://10.139.131.91:6443
cp /etc/kubernetes/kubelet.conf.5151.2024-08-06@18\:32\:01~ /etc/kubernetes/kubelet.conf
systemctl restart kubelet
root@node-8:~# crictl pods | grep nginx e8e2c11236d72 52 seconds ago Ready nginx-proxy-node-8 kube-system 1 (default) 16e3dc577108c 22 hours ago NotReady nginx-proxy-node-8 kube-system 0 (default)
Now the node gets ready again. 🎉
root@node-7:~# ls -la /etc/kubernetes/kubelet.conf /etc/kubernetes/kubelet.conf.4690.2024-08-06@18\:32\:01~ -rw------- 1 root root 1950 Aug 6 18:32 /etc/kubernetes/kubelet.conf -rw------- 1 root root 1954 Aug 6 18:31 /etc/kubernetes/kubelet.conf.4690.2024-08-06@18:32:01~
The text was updated successfully, but these errors were encountered:
setting loadbalancer_apiserver_localhost: false in group_vars fixes the problem.
loadbalancer_apiserver_localhost: false
After a reboot, the worker node comes back into the cluster.
Sorry, something went wrong.
No branches or pull requests
What happened?
In a kubespray-cluster with a single control-plane:
When rebooting a worker-node (without any control-plane-components), the node does not get ready again.
What did you expect to happen?
The worker-node gets ready again after the reboot.
How can we reproduce it (as minimally and precisely as possible)?
Deploy a kubespray-cluster with a single control-plane.
Reboot a worker-node without draining it before.
OS
We run ansible via gitlab-ci with
quay.io/kubespray/kubespray:v2.25.0
, so the versions are:Version of Ansible
Version of Python
Version of Kubespray (commit)
7e0a40725
(which isv2.25.0
)Network plugin used
calico
Full inventory with variables
https://gist.github.com/rdxmb/099f6ebd3979369f059a1efdc18f0ec2
Command used to invoke ansible
Output of ansible run
--- anything is ok here, so I do not post the output ---
Anything else we need to know
For me it seems to be kind of a hen's egg problem:
kubelet cannot connect the apiserver via
localhost:6443
, where thenginx-proxy-node-[n]
should run and route to the kubernetes-apiserver.nginx-proxy-node-[n]
cannot get ready because kubelet is not working correctly ...There is also a backup-file created by kubespray with the correct server-ip included:
Workaround
Now the node gets ready again. 🎉
Just some more information:
The text was updated successfully, but these errors were encountered: