Skip to content

Commit c423c3a

Browse files
authored
Merge pull request #3057 from replicatedhq/106919
Adding EC troubleshooting page
2 parents cf45be1 + 953c27e commit c423c3a

File tree

6 files changed

+263
-67
lines changed

6 files changed

+263
-67
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
To access the cluster and use other included binaries:
2+
3+
1. SSH onto a controller node.
4+
5+
:::note
6+
You cannot run the `shell` command on worker nodes.
7+
:::
8+
9+
1. Use the Embedded Cluster shell command to start a shell with access to the cluster:
10+
11+
```
12+
sudo ./APP_SLUG shell
13+
```
14+
Where `APP_SLUG` is the unique slug for the application.
15+
16+
The output looks similar to the following:
17+
```
18+
__4___
19+
_ \ \ \ \ Welcome to APP_SLUG debug shell.
20+
<'\ /_/_/_/ This terminal is now configured to access your cluster.
21+
((____!___/) Type 'exit' (or CTRL+d) to exit.
22+
\0\0\0\0\/ Happy hacking.
23+
~~~~~~~~~~~
24+
root@alex-ec-1:/home/alex# export KUBECONFIG="/var/lib/embedded-cluster/k0s/pki/admin.conf"
25+
root@alex-ec-1:/home/alex# export PATH="$PATH:/var/lib/embedded-cluster/bin"
26+
root@alex-ec-1:/home/alex# source <(k0s completion bash)
27+
root@alex-ec-1:/home/alex# source <(cat /var/lib/embedded-cluster/bin/kubectl_completion_bash.sh)
28+
root@alex-ec-1:/home/alex# source /etc/bash_completion
29+
```
30+
31+
The appropriate kubeconfig is exported, and the location of useful binaries like kubectl and Replicated’s preflight and support-bundle plugins is added to PATH.
32+
33+
1. Use the available binaries as needed.
34+
35+
**Example**:
36+
37+
```bash
38+
kubectl version
39+
```
40+
```
41+
Client Version: v1.29.1
42+
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
43+
Server Version: v1.29.1+k0s
44+
```
45+
46+
1. Type `exit` or **Ctrl + D** to exit the shell.
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
Embedded Cluster includes a default support bundle spec that collects both host- and cluster-level information.
1+
Embedded Cluster includes a default support bundle spec that collects both host- and cluster-level information:
22

3-
The host-level information is useful for troubleshooting failures related to host configuration like DNS, networking, or storage problems. Cluster-level information includes details about the components provided by Replicated, such as the Admin Console and Embedded Cluster operator that manage install and upgrade operations. If the cluster has not installed successfully and cluster-level information is not available, then it is excluded from the bundle.
3+
* The host-level information is useful for troubleshooting failures related to host configuration like DNS, networking, or storage problems.
4+
* Cluster-level information includes details about the components provided by Replicated, such as the Admin Console and Embedded Cluster Operator that manage install and upgrade operations. If the cluster has not installed successfully and cluster-level information is not available, then it is excluded from the bundle.
45

56
In addition to the host- and cluster-level details provided by the default Embedded Cluster spec, support bundles generated for Embedded Cluster installations also include app-level details provided by any custom support bundle specs that you included in the application release.

docs/partials/support-bundles/_generate-bundle-ec.mdx

+3-5
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,4 @@
1-
There are different steps to generate a support bundle depending on the version of Embedded Cluster installed.
2-
3-
### For Versions 1.17.0 and Later
1+
### Generate a Bundle For Versions 1.17.0 and Later
42

53
For Embedded Cluster 1.17.0 and later, you can run the Embedded Cluster `support-bundle` command to generate a support bundle.
64

@@ -22,7 +20,7 @@ To generate a support bundle:
2220

2321
Where `APP_SLUG` is the unique slug for the application.
2422

25-
### For Versions Earlier Than 1.17.0
23+
### Generate a Bundle For Versions Earlier Than 1.17.0
2624

2725
For Embedded Cluster versions earlier than 1.17.0, you can generate a support bundle from the shell using the kubectl support-bundle plugin.
2826

@@ -42,7 +40,7 @@ To generate a bundle:
4240
The output looks similar to the following:
4341

4442
```bash
45-
__4___
43+
__4___
4644
_ \ \ \ \ Welcome to APP_SLUG debug shell.
4745
<'\ /_/_/_/ This terminal is now configured to access your cluster.
4846
((____!___/) Type 'exit' (or CTRL+d) to exit.
+202
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
import SupportBundleIntro from "../partials/support-bundles/_ec-support-bundle-intro.mdx"
2+
import EmbeddedClusterSupportBundle from "../partials/support-bundles/_generate-bundle-ec.mdx"
3+
import ShellCommand from "../partials/embedded-cluster/_shell-command.mdx"
4+
import Tabs from '@theme/Tabs';
5+
import TabItem from '@theme/TabItem';
6+
7+
# Troubleshooting Embedded Cluster
8+
9+
This topic provides information about troubleshooting Replicated Embedded Cluster installations. For more information about Embedded Cluster, including built-in extensions and architecture, see [Embedded Cluster Overview](/vendor/embedded-overview).
10+
11+
## Troubleshoot with Support Bundles
12+
13+
This section includes information about how to collect support bundles for Embedded Cluster installations. For more information about support bundles, see [About Preflight Checks and Support Bundles](/vendor/preflight-support-bundle-about).
14+
15+
### About the Default Embedded Cluster Support Bundle Spec
16+
17+
<SupportBundleIntro/>
18+
19+
<EmbeddedClusterSupportBundle/>
20+
21+
## View Logs
22+
23+
You can view logs for both Embedded Cluster and the k0s systemd service to help troubleshoot Embedded Cluster deployments.
24+
25+
### View Installation Logs for Embedded Cluster
26+
27+
To view installation logs for Embedded Cluster:
28+
29+
1. SSH onto a controller node.
30+
31+
1. Navigate to `/var/log/embedded-cluster` and open the `.log` file to view logs.
32+
33+
### View k0s Logs
34+
35+
You can use the journalctl command line tool to access logs for systemd services, including k0s. For more information about k0s, see the [k0s documentation](https://docs.k0sproject.io/stable/).
36+
37+
To use journalctl to view k0s logs:
38+
39+
1. SSH onto a controller node or a worker node.
40+
41+
1. Use journalctl to view logs for the k0s systemd service that was deployed by Embedded Cluster.
42+
43+
**Examples:**
44+
45+
```bash
46+
journalctl -u k0scontroller
47+
```
48+
```bash
49+
journalctl -u k0sworker
50+
```
51+
52+
## Access the Cluster
53+
54+
When troubleshooting, it can be useful to list the cluster and view logs using the kubectl command line tool. For additional suggestions related to troubleshooting applications, see [Troubleshooting Applications](https://kubernetes.io/docs/tasks/debug/debug-application/) in the Kubernetes documentation.
55+
56+
<ShellCommand/>
57+
58+
## Troubleshoot Errors
59+
60+
This section provides troubleshooting advice for common errors.
61+
62+
### Installation failure when NVIDIA GPU Operator is included as Helm extension {#nvidia}
63+
64+
#### Symptom
65+
66+
A release that includes that includes the NVIDIA GPU Operator as a Helm extension fails to install.
67+
68+
#### Cause
69+
70+
If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail.
71+
72+
This is the result of a known issue with v24.9.x of the NVIDIA GPU Operator. For more information about the known issue, see [container-toolkit does not modify the containerd config correctly when there are multiple instances of the containerd binary](https://github.com/NVIDIA/nvidia-container-toolkit/issues/982) in the nvidia-container-toolkit repository in GitHub.
73+
74+
For more information about including the GPU Operator as a Helm extension, see [NVIDIA GPU Operator](/vendor/embedded-using#nvidia-gpu-operator) in _Using Embedded Cluster_.
75+
76+
#### Solution
77+
78+
To troubleshoot:
79+
80+
1. Remove any existing containerd services that are running on the host (such as those deployed by Docker).
81+
82+
1. Reset and reboot the node:
83+
84+
```bash
85+
sudo ./APP_SLUG reset
86+
```
87+
Where `APP_SLUG` is the unique slug for the application.
88+
89+
For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_.
90+
91+
1. Re-install with Embedded Cluster.
92+
93+
### Calico networking issues
94+
95+
#### Symptom
96+
97+
Symptoms of Calico networking issues can include:
98+
99+
* The pod is stuck in a CrashLoopBackOff state with failed health checks:
100+
101+
```
102+
Warning Unhealthy 6h51m (x3 over 6h52m) kubelet Liveness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host
103+
Warning Unhealthy 6h51m (x19 over 6h52m) kubelet Readiness probe failed: Get "http://<ip:port>/readyz": dial tcp <ip:port>: connect: no route to host
104+
....
105+
Unhealthy pod/registry-dc699cbcf-pkkbr Readiness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
106+
Unhealthy pod/registry-dc699cbcf-pkkbr Liveness probe failed: Get "https://<ip:port>/": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)
107+
...
108+
```
109+
110+
* The pod log contains an I/O timeout:
111+
112+
```
113+
server APIs: config.k8ssandra.io/v1beta1: Get \"https://***HIDDEN***:443/apis/config.k8ssandra.io/v1beta1\": dial tcp ***HIDDEN***:443: i/o timeout"}
114+
```
115+
116+
#### Cause
117+
118+
Reasons can include:
119+
120+
* Pod CIDR and service CIDR overlap with the host network CIDR.
121+
122+
* Incorrect kernel parameters values.
123+
124+
* VXLAN traffic getting dropped. By default, Calico uses VXLAN as the overlay networking protocol, with Always mode. This mode encapsulates all pod-to-pod traffic in VXLAN packets. If for some reasons, the VXLAN packets get filtered by the network, the pod will not able to communicate with other pods.
125+
126+
#### Solution
127+
128+
<Tabs>
129+
<TabItem value="overlap" label="Pod CIDR and service CIDR overlap with the host network CIDR" default>
130+
To troubleshoot pod CIDR and service CIDR overlapping with the host network CIDR:
131+
1. Run the following command to verify the pod and service CIDR:
132+
```
133+
cat /etc/k0s/k0s.yaml | grep -i cidr
134+
podCIDR: 10.244.0.0/17
135+
serviceCIDR: 10.244.128.0/17
136+
```
137+
The default pod CIDR is 10.244.0.0/16 and service CIDR is 10.96.0.0/12.
138+
139+
1. View pod network interfaces excluding Calico interfaces, and ensure there are no overlapping CIDRs.
140+
```
141+
ip route | grep -v cali
142+
default via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100
143+
10.152.0.1 dev ens4 proto dhcp scope link src 10.152.0.4 metric 100
144+
blackhole 10.244.101.192/26 proto 80
145+
169.254.169.254 via 10.152.0.1 dev ens4 proto dhcp src 10.152.0.4 metric 100
146+
```
147+
148+
1. Reset and reboot the installation:
149+
150+
```bash
151+
sudo ./APP_SLUG reset
152+
```
153+
Where `APP_SLUG` is the unique slug for the application.
154+
155+
For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_.
156+
157+
1. Reinstall the application with different CIDRs using the `--cidr` flag:
158+
159+
```bash
160+
sudo ./APP_SLUG install --license license.yaml --cidr 172.16.136.0/16
161+
```
162+
163+
For more information, see [Embedded Cluster Install Options](/reference/embedded-cluster-install).
164+
</TabItem>
165+
<TabItem value="kernel" label="Incorrect kernel parameter values">
166+
Embedded Cluster 1.19.0 and later automatically sets the `net.ipv4.conf.default.arp_filter`, `net.ipv4.conf.default.arp_ignore`, and `net.ipv4.ip_forward` kernel parameters. Additionally, host preflight checks automatically run during installation to verify that the kernel parameters were set correctly. For more information about the Embedded Cluster preflight checks, see [About Host Preflight Checks](/vendor/embedded-using#about-host-preflight-checks) in _Using Embedded Cluster_.
167+
168+
If kernel parameters are not set correctly and these preflight checks fail, you might see a message such as `IP forwarding must be enabled.` or `ARP filtering must be disabled by default for newly created interfaces.`.
169+
170+
To troubleshoot incorrect kernel parameter values:
171+
172+
1. Use sysctl to set the kernel parameters to the correct values:
173+
174+
```bash
175+
echo "net.ipv4.conf.default.arp_filter=0" >> /etc/sysctl.d/99-embedded-cluster.conf
176+
echo "net.ipv4.conf.default.arp_ignore=0" >> /etc/sysctl.d/99-embedded-cluster.conf
177+
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.d/99-embedded-cluster.conf
178+
179+
sysctl --system
180+
```
181+
182+
1. Reset and reboot the installation:
183+
184+
```bash
185+
sudo ./APP_SLUG reset
186+
```
187+
Where `APP_SLUG` is the unique slug for the application.
188+
For more information, see [Reset a Node](/vendor/embedded-using#reset-a-node) in _Using Embedded Cluster_.
189+
190+
1. Re-install with Embedded Cluster.
191+
</TabItem>
192+
<TabItem value="vxlan" label="VXLAN traffic dropped">
193+
194+
As a temporary troubleshooting measure, set the mode to CrossSubnet and see if the issue persists. This mode only encapsulates traffic between pods across different subnets with VXLAN.
195+
196+
```bash
197+
kubectl patch ippool default-ipv4-ippool --type=merge -p '{"spec": {"vxlanMode": "CrossSubnet"}}'
198+
```
199+
200+
If this resolves the connectivity issues, there is likely an underlying network configuration problem with VXLAN traffic that should be addressed.
201+
</TabItem>
202+
</Tabs>

docs/vendor/embedded-using.mdx

+8-60
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import UpdateOverview from "../partials/embedded-cluster/_update-overview.mdx"
2-
import SupportBundleIntro from "../partials/support-bundles/_ec-support-bundle-intro.mdx"
3-
import EmbeddedClusterSupportBundle from "../partials/support-bundles/_generate-bundle-ec.mdx"
42
import EcConfig from "../partials/embedded-cluster/_ec-config.mdx"
3+
import ShellCommand from "../partials/embedded-cluster/_shell-command.mdx"
54

65
# Using Embedded Cluster
76

@@ -168,58 +167,13 @@ For more information about updating, see [Performing Updates with Embedded Clust
168167

169168
## Access the Cluster
170169

171-
With Embedded Cluster, end-users are rarely supposed to need to use the CLI. Typical workflows, like updating the application and the cluster, are driven through the Admin Console.
170+
With Embedded Cluster, end users rarely need to use the CLI. Typical workflows, like updating the application and the cluster, can be done through the Admin Console. Nonetheless, there are times when vendors or their customers need to use the CLI for development or troubleshooting.
172171

173-
Nonetheless, there are times when vendors or their customers need to use the CLI for development or troubleshooting.
174-
175-
To access the cluster and use other included binaries:
176-
177-
1. SSH onto a controller node.
178-
179-
1. Use the Embedded Cluster shell command to start a shell with access to the cluster:
180-
181-
```
182-
sudo ./APP_SLUG shell
183-
```
184-
185-
The output looks similar to the following:
186-
```
187-
__4___
188-
_ \ \ \ \ Welcome to APP_SLUG debug shell.
189-
<'\ /_/_/_/ This terminal is now configured to access your cluster.
190-
((____!___/) Type 'exit' (or CTRL+d) to exit.
191-
\0\0\0\0\/ Happy hacking.
192-
~~~~~~~~~~~
193-
root@alex-ec-2:/home/alex# export KUBECONFIG="/var/lib/embedded-cluster/k0s/pki/admin.conf"
194-
root@alex-ec-2:/home/alex# export PATH="$PATH:/var/lib/embedded-cluster/bin"
195-
root@alex-ec-2:/home/alex# source <(kubectl completion bash)
196-
root@alex-ec-2:/home/alex# source /etc/bash_completion
197-
```
198-
199-
The appropriate kubeconfig is exported, and the location of useful binaries like kubectl and Replicated’s preflight and support-bundle plugins is added to PATH.
200-
201-
:::note
202-
You cannot run the `shell` command on worker nodes.
203-
:::
204-
205-
1. Use the available binaries as needed.
206-
207-
**Example**:
208-
209-
```bash
210-
kubectl version
211-
```
212-
```
213-
Client Version: v1.29.1
214-
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
215-
Server Version: v1.29.1+k0s
216-
```
217-
218-
1. Type `exit` or **Ctrl + D** to exit the shell.
172+
:::note
173+
If you encounter a typical workflow where your customers have to use the Embedded Cluster shell, reach out to Alex Parker at [email protected]. These workflows might be candidates for additional Admin Console functionality.
174+
:::
219175

220-
:::note
221-
If you encounter a typical workflow where your customers have to use the Embedded Cluster shell, reach out to Alex Parker at [email protected]. These workflows might be candidates for additional Admin Console functionality.
222-
:::
176+
<ShellCommand/>
223177

224178
## Reset a Node
225179

@@ -281,13 +235,7 @@ Using the NVIDIA GPU Operator with Embedded Cluster requires configuring the con
281235
When the containerd options are configured as shown above, the NVIDIA GPU Operator automatically creates the required configurations in the `/etc/k0s/containerd.d/nvidia.toml` file. It is not necessary to create this file manually, or modify any other configuration on the hosts.
282236

283237
:::note
284-
If you include the NVIDIA GPU Operator as a Helm extension, remove any existing containerd services that are running on the host (such as those deployed by Docker) before attempting to install the release with Embedded Cluster. If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail.
238+
If you include the NVIDIA GPU Operator as a Helm extension, remove any existing containerd services that are running on the host (such as those deployed by Docker) before attempting to install the release with Embedded Cluster. If there are any containerd services on the host, the NVIDIA GPU Operator will generate an invalid containerd config, causing the installation to fail. For more information, see [Installation failure when NVIDIA GPU Operator is included as Helm extension](#nvidia) in _Troubleshooting Embedded Cluster_.
285239

286240
This is the result of a known issue with v24.9.x of the NVIDIA GPU Operator. For more information about the known issue, see [container-toolkit does not modify the containerd config correctly when there are multiple instances of the containerd binary](https://github.com/NVIDIA/nvidia-container-toolkit/issues/982) in the nvidia-container-toolkit repository in GitHub.
287-
:::
288-
289-
## Troubleshoot with Support Bundles
290-
291-
<SupportBundleIntro/>
292-
293-
<EmbeddedClusterSupportBundle/>
241+
:::

sidebars.js

+1
Original file line numberDiff line numberDiff line change
@@ -245,6 +245,7 @@ const sidebars = {
245245
},
246246
'enterprise/embedded-manage-nodes',
247247
'enterprise/updating-embedded',
248+
'vendor/embedded-troubleshooting',
248249
'enterprise/embedded-tls-certs',
249250
'vendor/embedded-disaster-recovery',
250251
],

0 commit comments

Comments
 (0)