|
| 1 | +import SupportBundleIntro from "../partials/support-bundles/_ec-support-bundle-intro.mdx" |
| 2 | +import EmbeddedClusterSupportBundle from "../partials/support-bundles/_generate-bundle-ec.mdx" |
| 3 | +import ShellCommand from "../partials/embedded-cluster/_shell-command.mdx" |
| 4 | +import Tabs from '@theme/Tabs'; |
| 5 | +import TabItem from '@theme/TabItem'; |
| 6 | + |
| 7 | +# Troubleshooting Embedded Cluster |
| 8 | + |
| 9 | +This topic provides information about troubleshooting Replicated Embedded Cluster installations. For more information about Embedded Cluster, including built-in extensions and architecture, see [Embedded Cluster Overview](/vendor/embedded-overview). |
| 10 | + |
| 11 | +## Troubleshoot with Support Bundles |
| 12 | + |
| 13 | +This section includes information about how to collect support bundles for Embedded Cluster installations. For more information about support bundles, see [About Preflight Checks and Support Bundles](/vendor/preflight-support-bundle-about). |
| 14 | + |
| 15 | +### About the Default Embedded Cluster Support Bundle Spec |
| 16 | + |
| 17 | +<SupportBundleIntro/> |
| 18 | + |
| 19 | +<EmbeddedClusterSupportBundle/> |
| 20 | + |
| 21 | +## View Logs |
| 22 | + |
| 23 | +You can view logs for both Embedded Cluster and the k0s systemd service to help troubleshoot Embedded Cluster deployments. |
| 24 | + |
| 25 | +### View Installation Logs for Embedded Cluster |
| 26 | + |
| 27 | +To view installation logs for Embedded Cluster: |
| 28 | + |
| 29 | +1. SSH onto a controller node. |
| 30 | + |
| 31 | +1. Navigate to `/var/log/embedded-cluster` and open the `.log` file to view logs. |
| 32 | + |
| 33 | +### View k0s Logs |
| 34 | + |
| 35 | +You can use the journalctl command line tool to access logs for systemd services, including k0s. For more information about k0s, see the [k0s documentation](https://docs.k0sproject.io/stable/). |
| 36 | + |
| 37 | +To use journalctl to view k0s logs: |
| 38 | + |
| 39 | +1. SSH onto a controller node or a worker node. |
| 40 | + |
| 41 | +1. Use journalctl to view logs for the k0s systemd service that was deployed by Embedded Cluster. |
| 42 | + |
| 43 | + **Examples:** |
| 44 | + |
| 45 | + ```bash |
| 46 | + journalctl -u k0scontroller |
| 47 | + ``` |
| 48 | + ```bash |
| 49 | + journalctl -u k0sworker |
| 50 | + ``` |
| 51 | + |
| 52 | +## Access the Cluster |
| 53 | + |
| 54 | +When troubleshooting, it can be useful to list the cluster and view logs using the kubectl command line tool. For additional suggestions related to troubleshooting applications, see [Troubleshooting Applications](https://kubernetes.io/docs/tasks/debug/debug-application/) in the Kubernetes documentation. |
| 55 | + |
| 56 | +<ShellCommand/> |
| 57 | + |
| 58 | +## Troubleshoot Errors |
| 59 | + |
| 60 | +This section provides troubleshooting advice for common errors. |
| 61 | + |
| 62 | +### Installation failure when NVIDIA GPU Operator is included as Helm extension {#nvidia} |
| 63 | + |
| 64 | +#### Symptom |
| 65 | + |
| 66 | +A release that includes that includes the NVIDIA GPU Operator as a Helm extension fails to install. |
| 67 | + |
| 68 | +#### Cause |
| 69 | + |
| 70 | +If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail. |
| 71 | + |
| 72 | +This is the result of a known issue with v24.9.x of the NVIDIA GPU Operator. For more information about the known issue, see [container-toolkit does not modify the containerd config correctly when there are multiple instances of the containerd binary](https://github.com/NVIDIA/nvidia-container-toolkit/issues/982) in the nvidia-container-toolkit repository in GitHub. |
| 73 | + |
| 74 | +For more information about including the GPU Operator as a Helm extension, see [NVIDIA GPU Operator](/vendor/embedded-using#nvidia-gpu-operator) in _Using Embedded Cluster_. |
| 75 | + |
| 76 | +#### Solution |
| 77 | + |
| 78 | +To troubleshoot: |
| 79 | + |
| 80 | +1. Remove any existing containerd services that are running on the host (such as those deployed by Docker). |
| 81 | + |
| 82 | +1. Reset and reboot the node: |
| 83 | + |
| 84 | + ```bash |
| 85 | + sudo ./APP_SLUG reset |
| 86 | + ``` |
| 87 | + Where `APP_SLUG` is the unique slug for the application. |
| 88 | + |
| 89 | + For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_. |
| 90 | + |
| 91 | +1. Re-install with Embedded Cluster. |
| 92 | + |
| 93 | +### Calico networking issues |
| 94 | + |
| 95 | +#### Symptom |
| 96 | + |
| 97 | +Symptoms of Calico networking issues can include: |
| 98 | + |
| 99 | +* The pod is stuck in a CrashLoopBackOff state with failed health checks: |
| 100 | + |
| 101 | + ``` |
| 102 | + Warning Unhealthy 6h51m (x3 over 6h52m) kubelet Liveness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host |
| 103 | + Warning Unhealthy 6h51m (x19 over 6h52m) kubelet Readiness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host |
| 104 | + .... |
| 105 | + Unhealthy pod/registry-dc699cbcf-pkkbr Readiness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) |
| 106 | + Unhealthy pod/registry-dc699cbcf-pkkbr Liveness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers) |
| 107 | + ... |
| 108 | + ``` |
| 109 | + |
| 110 | +* The pod log contains an I/O timeout: |
| 111 | + |
| 112 | + ``` |
| 113 | + server APIs: config.k8ssandra.io/v1beta1: Get \"https://***HIDDEN***:443/apis/config.k8ssandra.io/v1beta1\": dial tcp ***HIDDEN***:443: i/o timeout"} |
| 114 | + ``` |
| 115 | + |
| 116 | +#### Cause |
| 117 | + |
| 118 | +Reasons can include: |
| 119 | + |
| 120 | +* Pod CIDR and service CIDR overlap with the host network CIDR. |
| 121 | + |
| 122 | +* Incorrect kernel parameters values. |
| 123 | + |
| 124 | +* VXLAN traffic getting dropped. By default, Calico uses VXLAN as the overlay networking protocol, with Always mode. This mode encapsulates all pod-to-pod traffic in VXLAN packets. If for some reasons, the VXLAN packets get filtered by the network, the pod will not able to communicate with other pods. |
| 125 | + |
| 126 | +#### Solution |
| 127 | + |
| 128 | +<Tabs> |
| 129 | + <TabItem value="overlap" label="Pod CIDR and service CIDR overlap with the host network CIDR" default> |
| 130 | + To troubleshoot pod CIDR and service CIDR overlapping with the host network CIDR: |
| 131 | + 1. Run the following command to verify the pod and service CIDR: |
| 132 | + ``` |
| 133 | + cat /etc/k0s/k0s.yaml | grep -i cidr |
| 134 | + podCIDR: 10.244.0.0/17 |
| 135 | + serviceCIDR: 10.244.128.0/17 |
| 136 | + ``` |
| 137 | + The default pod CIDR is 10.244.0.0/16 and service CIDR is 10.96.0.0/12. |
| 138 | + |
| 139 | + 1. View pod network interfaces excluding Calico interfaces, and ensure there are no overlapping CIDRs. |
| 140 | + ``` |
| 141 | + ip route | grep -v cali |
| 142 | + default via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100 |
| 143 | + 10.152.0.1 dev ens4 proto dhcp scope link src 10.152.0.4 metric 100 |
| 144 | + blackhole 10.244.101.192/26 proto 80 |
| 145 | + 169.254.169.254 via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100 |
| 146 | + ``` |
| 147 | + |
| 148 | + 1. Reset and reboot the installation: |
| 149 | + |
| 150 | + ```bash |
| 151 | + sudo ./APP_SLUG reset |
| 152 | + ``` |
| 153 | + Where `APP_SLUG` is the unique slug for the application. |
| 154 | + |
| 155 | + For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_. |
| 156 | + |
| 157 | + 1. Reinstall the application with different CIDRs using the `--cidr` flag: |
| 158 | + |
| 159 | + ```bash |
| 160 | + sudo ./APP_SLUG install --license license.yaml --cidr 172.16.136.0/16 |
| 161 | + ``` |
| 162 | + |
| 163 | + For more information, see [Embedded Cluster Install Options](/reference/embedded-cluster-install). |
| 164 | + </TabItem> |
| 165 | + <TabItem value="kernel" label="Incorrect kernel parameter values"> |
| 166 | + Embedded Cluster 1.19.0 and later automatically sets the `net.ipv4.conf.default.arp_filter`, `net.ipv4.conf.default.arp_ignore`, and `net.ipv4.ip_forward` kernel parameters. Additionally, host preflight checks automatically run during installation to verify that the kernel parameters were set correctly. For more information about the Embedded Cluster preflight checks, see [About Host Preflight Checks](/vendor/embedded-using#about-host-preflight-checks) in _Using Embedded Cluster_. |
| 167 | + |
| 168 | + If kernel parameters are not set correctly and these preflight checks fail, you might see a message such as `IP forwarding must be enabled.` or `ARP filtering must be disabled by default for newly created interfaces.`. |
| 169 | + |
| 170 | + To troubleshoot incorrect kernel parameter values: |
| 171 | + |
| 172 | + 1. Use sysctl to set the kernel parameters to the correct values: |
| 173 | + |
| 174 | + ```bash |
| 175 | + echo "net.ipv4.conf.default.arp_filter=0" >> /etc/sysctl.d/99-embedded-cluster.conf |
| 176 | + echo "net.ipv4.conf.default.arp_ignore=0" >> /etc/sysctl.d/99-embedded-cluster.conf |
| 177 | + echo "net.ipv4.ip_forward=1" >> /etc/sysctl.d/99-embedded-cluster.conf |
| 178 | + |
| 179 | + sysctl --system |
| 180 | + ``` |
| 181 | + |
| 182 | + 1. Reset and reboot the installation: |
| 183 | + |
| 184 | + ```bash |
| 185 | + sudo ./APP_SLUG reset |
| 186 | + ``` |
| 187 | + Where `APP_SLUG` is the unique slug for the application. |
| 188 | + For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_. |
| 189 | + |
| 190 | + 1. Re-install with Embedded Cluster. |
| 191 | + </TabItem> |
| 192 | + <TabItem value="vxlan" label="VXLAN traffic dropped"> |
| 193 | + |
| 194 | + As a temporary troubleshooting measure, set the mode to CrossSubnet and see if the issue persists. This mode only encapsulates traffic between pods across different subnets with VXLAN. |
| 195 | + |
| 196 | + ```bash |
| 197 | + kubectl patch ippool default-ipv4-ippool --type=merge -p '{"spec": {"vxlanMode": "CrossSubnet"}}' |
| 198 | + ``` |
| 199 | + |
| 200 | + If this resolves the connectivity issues, there is likely an underlying network configuration problem with VXLAN traffic that should be addressed. |
| 201 | + </TabItem> |
| 202 | +</Tabs> |
0 commit comments