Solution Worker Node Failure

Lets have a look at the Practice Test of the Worker Node Failure

Solution

Fix the broken cluster
- Fix node01
1. Check the nodes
```
kubectl get nodes
```
  We see that node01 has a status of NotReady. This usually means that communication with the node's kubelet has been lost.
2. Go to the node and investigate
```
ssh node01
```
3. Check kubelet status
```
systemctl status kubelet
```
  We can see from the output that kublet is not running, in fact it has exited. Therefore we should try starting it.
4. Start kubelet
```
systemctl start kubelet
```
5. Now check it is OK.
```
systemctl status kubelet
```
  Now we can see it is active (running), which is good.
6. Return to controlplane
```
exit
```
7. Check nodes again
```
kubectl get nodes
```
  It is good!
The cluster is broken again. Investigate and fix the issue.
- Fix cluster
1. Check the nodes
```
kubectl get nodes
```
  We see that node01 has a status of NotReady. This usually means that communication with the node's kubelet has been lost.
2. Go to the node and investigate
```
ssh node01
```
3. Check kubelet status
```
systemctl status kubelet
```
  We can see from the output that it is crashlooping activating (auto-restart), therefore this is likey a configuration issue.
4. Check kubelet logs
```
journalctl -u kubelet
```
  There is a lot of information, however the error we are interested in, which is the cause of all other errors is this one
```
"failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/pki/WRONG-CA-FILE.crt: open /etc/kubernetes/pki/WRONG-CA-FILE.crt: no such file or directory"
```
  If kubelet cannot load its certificates, then it cannot autheticate with API server. This is a fatal error, so kubelet exits.
5. Check the indicated directory for certificates
```
ls -l /etc/kubernetes/pki
```
  We see it contains ca.crt which we will assume is the correct certificate, therefore we need to find the kubelet configuration file and correct the error there.
6. Locate kubelet's configuration file
  
  kubelet is an operating system service, so its service unit file will give us that info
```
systemctl cat kubelet
```
  Note this line
```
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
```
  There is the config YAML file
7. Fix configuration
```
vi /var/lib/kubelet/config.yaml
```
```
apiVersion: kubelet.config.k8s.io/v1beta1
authentication:
anonymous:
  enabled: false
webhook:
  cacheTTL: 0s
  enabled: true
x509:
  clientCAFile: /etc/kubernetes/pki/WRONG-CA-FILE.crt # <- Fix this
authorization:
mode: Webhook
```
  Note that you can perform the same edit with a single sed command. This is quicker than editing in vi.
```
sed -i 's/WRONG-CA-FILE.crt/ca.crt/g' /var/lib/kubelet/config.yaml
```
8. Check status
  
  Wait a few seconds, kubelet will be auto-restarted.
```
systemctl status kubelet
```
  Now we can see it is active (running), which is good. If it is not, then you made a mistake when editing the config file, probably broke the YAML syntax or did not edit the certificate filename correctly. Return to step vii. above and fix it.
9. Return to controlplane
```
exit
```
10. Check nodes again
```
kubectl get nodes
```
  It is good!
The cluster is broken again. Investigate and fix the issue.
- Fix cluster
1. Check the nodes
```
kubectl get nodes
```
  We see that node01 has a status of NotReady. This usually means that communication with the node's kubelet has been lost.
2. Go to the node and investigate
```
ssh node01
```
3. Check kubelet status
```
systemctl status kubelet
```
  We can see it is active (running), however the API server still thinks there is an issue. So we must again go to the kubelet logs.
4. Check kubelet logs
```
journalctl -u kubelet
```
  There is a lot of information, however the error we are interested in, which is the cause of all other errors is this one
```
 "Unable to register node with API server" err="Post \"https://controlplane:6553/api/v1/nodes\": dial tcp 192.10.46.12:6553: connect: connection refused" node="node01"
```
  What do you know about the usual port for API server? It's not 6553! kubelet uses a kubeconfig file to connect to API server just like kubectl does, so we need to locate and fix that.
5. Locate kubelet's kubeconfig file
  
  kubelet is an operating system service, so its service unit file will give us that info
```
systemctl cat kubelet
```
  Note this line
```
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
```
  There are two kubeconfigs. The first one is used when a node is created and is joining the cluster. The second one is used for normal operation. It is therefore the second one we are interested in.
6. Fix the kubeconfig
  
  Port should be 6443
```
vi /etc/kubernetes/kubelet.conf
```
```
apiVersion: v1
clusters:
- cluster:
   certificate-authority-data: REDACTED
   server: https://controlplane:6553  # <- Fix this
name: default-cluster
contexts:
- context:
   cluster: default-cluster
   namespace: default
   user: default-auth
name: default-context
current-context: default-context
kind: Config
preferences: {}
users:
- name: default-auth
user:
   client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
   client-key: /var/lib/kubelet/pki/kubelet-client-current.pem
```
  Note that you can perform the same edit with a single sed command. This is quicker than editing in vi.
```
sed -i 's/6553/6443/g' /etc/kubernetes/kubelet.conf
```
7. Restart kubelet
  
  Since kubelet is already running (not crashlooping), we need to restart it so it gets the updated kubeconfig
```
systemctl restart kubelet
```
8. Check status
```
systemctl status kubelet
```
  Now we can see it is active (running), which is good. If it is not, then you made a mistake when editing the kubeconfig, probably broke the YAML syntax. Return to step vi. above and fix it.
9. Return to controlplane
```
exit
```
10. Check nodes again
```
kubectl get nodes
```
  It is good! If it is not, then you probably made a mistake setting the port number. Return to node01 and redo from step vi. above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

09-Solution-Worker-Node-Failure.md

09-Solution-Worker-Node-Failure.md

Solution Worker Node Failure

Solution

Files

09-Solution-Worker-Node-Failure.md

Latest commit

History

09-Solution-Worker-Node-Failure.md

File metadata and controls

Solution Worker Node Failure

Solution