Skip to content

Latest commit

 

History

History
304 lines (204 loc) · 8.08 KB

09-Solution-Worker-Node-Failure.md

File metadata and controls

304 lines (204 loc) · 8.08 KB

Solution Worker Node Failure

  • Lets have a look at the Practice Test of the Worker Node Failure

Solution

  1. Fix the broken cluster
    • Fix node01
    1. Check the nodes

      kubectl get nodes

      We see that node01 has a status of NotReady. This usually means that communication with the node's kubelet has been lost.

    2. Go to the node and investigate

      ssh node01
    3. Check kubelet status

      systemctl status kubelet

      We can see from the output that kublet is not running, in fact it has exited. Therefore we should try starting it.

    4. Start kubelet

      systemctl start kubelet
    5. Now check it is OK.

      systemctl status kubelet

      Now we can see it is active (running), which is good.

    6. Return to controlplane

      exit
    7. Check nodes again

      kubectl get nodes

      It is good!

  2. The cluster is broken again. Investigate and fix the issue.
    • Fix cluster
    1. Check the nodes

      kubectl get nodes

      We see that node01 has a status of NotReady. This usually means that communication with the node's kubelet has been lost.

    2. Go to the node and investigate

      ssh node01
    3. Check kubelet status

      systemctl status kubelet

      We can see from the output that it is crashlooping activating (auto-restart), therefore this is likey a configuration issue.

    4. Check kubelet logs

      journalctl -u kubelet

      There is a lot of information, however the error we are interested in, which is the cause of all other errors is this one

      "failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/pki/WRONG-CA-FILE.crt: open /etc/kubernetes/pki/WRONG-CA-FILE.crt: no such file or directory"
      

      If kubelet cannot load its certificates, then it cannot autheticate with API server. This is a fatal error, so kubelet exits.

    5. Check the indicated directory for certificates

      ls -l /etc/kubernetes/pki

      We see it contains ca.crt which we will assume is the correct certificate, therefore we need to find the kubelet configuration file and correct the error there.

    6. Locate kubelet's configuration file

      kubelet is an operating system service, so its service unit file will give us that info

      systemctl cat kubelet

      Note this line

      Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
      

      There is the config YAML file

    7. Fix configuration

      vi /var/lib/kubelet/config.yaml
      apiVersion: kubelet.config.k8s.io/v1beta1
      authentication:
      anonymous:
        enabled: false
      webhook:
        cacheTTL: 0s
        enabled: true
      x509:
        clientCAFile: /etc/kubernetes/pki/WRONG-CA-FILE.crt # <- Fix this
      authorization:
      mode: Webhook

      Note that you can perform the same edit with a single sed command. This is quicker than editing in vi.

      sed -i 's/WRONG-CA-FILE.crt/ca.crt/g' /var/lib/kubelet/config.yaml
    8. Check status

      Wait a few seconds, kubelet will be auto-restarted.

      systemctl status kubelet

      Now we can see it is active (running), which is good. If it is not, then you made a mistake when editing the config file, probably broke the YAML syntax or did not edit the certificate filename correctly. Return to step vii. above and fix it.

    9. Return to controlplane

      exit
    10. Check nodes again

      kubectl get nodes

      It is good!

  3. The cluster is broken again. Investigate and fix the issue.
    • Fix cluster
    1. Check the nodes

      kubectl get nodes

      We see that node01 has a status of NotReady. This usually means that communication with the node's kubelet has been lost.

    2. Go to the node and investigate

      ssh node01
    3. Check kubelet status

      systemctl status kubelet

      We can see it is active (running), however the API server still thinks there is an issue. So we must again go to the kubelet logs.

    4. Check kubelet logs

      journalctl -u kubelet

      There is a lot of information, however the error we are interested in, which is the cause of all other errors is this one

       "Unable to register node with API server" err="Post \"https://controlplane:6553/api/v1/nodes\": dial tcp 192.10.46.12:6553: connect: connection refused" node="node01"
      

      What do you know about the usual port for API server? It's not 6553! kubelet uses a kubeconfig file to connect to API server just like kubectl does, so we need to locate and fix that.

    5. Locate kubelet's kubeconfig file

      kubelet is an operating system service, so its service unit file will give us that info

      systemctl cat kubelet

      Note this line

      Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
      

      There are two kubeconfigs. The first one is used when a node is created and is joining the cluster. The second one is used for normal operation. It is therefore the second one we are interested in.

    6. Fix the kubeconfig

      Port should be 6443

      vi /etc/kubernetes/kubelet.conf
      apiVersion: v1
      clusters:
      - cluster:
         certificate-authority-data: REDACTED
         server: https://controlplane:6553  # <- Fix this
      name: default-cluster
      contexts:
      - context:
         cluster: default-cluster
         namespace: default
         user: default-auth
      name: default-context
      current-context: default-context
      kind: Config
      preferences: {}
      users:
      - name: default-auth
      user:
         client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
         client-key: /var/lib/kubelet/pki/kubelet-client-current.pem

      Note that you can perform the same edit with a single sed command. This is quicker than editing in vi.

      sed -i 's/6553/6443/g' /etc/kubernetes/kubelet.conf
    7. Restart kubelet

      Since kubelet is already running (not crashlooping), we need to restart it so it gets the updated kubeconfig

      systemctl restart kubelet
    8. Check status

      systemctl status kubelet

      Now we can see it is active (running), which is good. If it is not, then you made a mistake when editing the kubeconfig, probably broke the YAML syntax. Return to step vi. above and fix it.

    9. Return to controlplane

      exit
    10. Check nodes again

      kubectl get nodes

      It is good! If it is not, then you probably made a mistake setting the port number. Return to node01 and redo from step vi. above.