Skip to content

Commit 71c99a3

Browse files
committed
Add troubleshooting of node maintenance mode
Signed-off-by: Jian Wang <[email protected]>
1 parent f01d373 commit 71c99a3

File tree

8 files changed

+294
-2
lines changed

8 files changed

+294
-2
lines changed

docs/host/host.md

+13-1
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,16 @@ For admin users, you can click **Enable Maintenance Mode** to evict all VMs from
2525

2626
![node-maintenance.png](/img/v1.2/host/node-maintenance.png)
2727

28+
After a while the target node will enter maintenance mode successfully.
29+
30+
![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png)
31+
32+
:::note important
33+
34+
Check those [known limitations and workarounds](../troubleshooting/host.md#an-enable-maintenance-mode-node-stucks-on-cordoned-state) before you click this menu or you have encountered some issues.
35+
36+
:::
37+
2838
## Cordoning a Node
2939

3040
Cordoning a node marks it as unschedulable. This feature is useful for performing short tasks on the node during small maintenance windows, like reboots, upgrades, or decommissions. When you’re done, power back on and make the node schedulable again by uncordoning it.
@@ -42,6 +52,8 @@ Before removing a node from a Harvester cluster, determine if the remaining node
4252

4353
If the remaining nodes do not have enough resources, VMs might fail to migrate and volumes might degrade when you remove a node.
4454

55+
If you have some volumes which were created from the customized `StorageClass` with the value **1** of the [Number of Replicas](../advanced/storageclass.md#number-of-replicas), it is recommended to backup those single-replica volumes or re-deploy the related workloads to other node in advance to get the volume scheduled to other node. Otherwise, those volumes can't be rebuilt or restored from other nodes after this node is removed.
56+
4557
:::
4658

4759
### 1. Check if the node can be removed from the cluster.
@@ -522,4 +534,4 @@ status:
522534
```
523535

524536
The `harvester-node-manager` pod(s) in the `harvester-system` namespace may also contain some hints as to why it is not rendering a file to a node.
525-
This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest.
537+
This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest.

docs/troubleshooting/host.md

+134
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
sidebar_position: 6
3+
sidebar_label: Host
4+
title: "Host"
5+
---
6+
7+
<head>
8+
<link rel="canonical" href="https://docs.harvesterhci.io/v1.3/troubleshooting/host"/>
9+
</head>
10+
11+
## An enable-maintenance-mode Node Stucks on Cordoned State
12+
13+
After you click the **Enable Maintenance Mode** menu upon one Harvester host, the target host stucks on `Cordoned` state, and the **Enable Maintenance Mode** menu is available again, the expected menu **Disable Maintenance Mode** is not available.
14+
15+
![node-stuck-cordoned.png](/img/v1.3/troubleshooting/node-stuck-cordoned.png)
16+
17+
When you check the Harvester pod log, there are repeated messages like:
18+
19+
```
20+
time="2024-08-05T19:03:02Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7"
21+
time="2024-08-05T19:03:02Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget."
22+
23+
time="2024-08-05T19:03:07Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7"
24+
time="2024-08-05T19:03:07Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget."
25+
26+
time="2024-08-05T19:03:12Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7"
27+
time="2024-08-05T19:03:12Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget."
28+
```
29+
30+
The Longhorn `instance-manager` uses pdb to protect itself from being evicted accidentally to avoid the data loss of volumes. When this error happens, it means the `instance-manager` pod is still serving some volumes/replicas.
31+
32+
There are some known causes and related workarounds.
33+
34+
### The Manually Attached Volume
35+
36+
When a Longhorn volume is attached to a host from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards), this volume will cause above the error.
37+
38+
You can check it from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards).
39+
40+
![attached-volume.png](/img/v1.3/troubleshooting/attached-volume.png)
41+
42+
The manually attached object is attached to a node name instead of the pod name.
43+
44+
You can also check it from CLI to get the CRD object `VolumeAttachment`.
45+
46+
The volume attached by Longhorn UI:
47+
48+
```
49+
- apiVersion: longhorn.io/v1beta2
50+
kind: VolumeAttachment
51+
...
52+
spec:
53+
attachmentTickets:
54+
longhorn-ui:
55+
id: longhorn-ui
56+
nodeID: node-name
57+
...
58+
volume: pvc-9b35136c-f59e-414b-aa55-b84b9b21ff89
59+
```
60+
61+
The volume attached by CSI driver:
62+
63+
```
64+
- apiVersion: longhorn.io/v1beta2
65+
kind: VolumeAttachment
66+
spec:
67+
attachmentTickets:
68+
csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf:
69+
id: csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf
70+
nodeID: node-name
71+
...
72+
volume: pvc-3c6403cd-f1cd-4b84-9b46-162f746b9667
73+
```
74+
75+
:::note
76+
77+
It is not recommended to attach a volume to the host manually.
78+
79+
:::
80+
81+
#### Workaround 1: Set Longhorn option `Detach Manually Attached Volumes When Cordoned` to True
82+
83+
The Longhorn option [Detach Manually Attached Volumes When Cordoned](https://longhorn.io/docs/1.6.0/references/settings/#detach-manually-attached-volumes-when-cordoned) defaults to `true`, it will block the node drain when there is any manually attached volume.
84+
85+
This options is available from Harvester v1.3.1 with the embedded Longhorn v1.6.0.
86+
87+
From Harvester v1.4.0, this option is set to `false` by default.
88+
89+
#### Workaround 2: Manually Detach the Volume
90+
91+
Detach the volume from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards).
92+
93+
![detached-volume.png](/img/v1.3/troubleshooting/detached-volume.png)
94+
95+
After that, the node will enter maintenance mode successfully.
96+
97+
![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png)
98+
99+
### The Single-replica Volume
100+
101+
Harvester supports to define the customized `StorageClass`, the [Number of Replicas](../advanced/storageclass.md#number-of-replicas) can even be 1 in some scenarios.
102+
103+
When such a volume was ever attached to a certain host by CSI driver or other ways, the last and only replica stays on this node even after the volume is detached from the node.
104+
105+
This can be checked from the CRD object `Volume`.
106+
107+
```
108+
- apiVersion: longhorn.io/v1beta2
109+
kind: Volume
110+
...
111+
spec:
112+
...
113+
numberOfReplicas: 1 // the replica number
114+
...
115+
status:
116+
...
117+
ownerID: nodeName
118+
...
119+
state: attached
120+
```
121+
122+
#### Workaround: Set Longhorn option `Node Drain Policy`
123+
124+
The Longhorn [Node Drain Policy](https://longhorn.io/docs/1.6.0/references/settings/#node-drain-policy) defaults to `block-if-contains-last-replica`. Longhorn will block the drain when the node contains the last healthy replica of a volume.
125+
126+
Set this option to `allow-if-replica-is-stopped` from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards) will solve this issue.
127+
128+
:::note important
129+
130+
If you plan to remove this node after it enters the maintenance mode, it is recommended to backup those single-replica volumes or redeploy the related workloads to other node in advance to get the volume scheduled to other node. Otherwise, those volumes can't be rebuilt or restored from other nodes after this node is removed.
131+
132+
:::
133+
134+
From Harvester v1.4.0, this option is set to `allow-if-replica-is-stopped` by default.
Loading
Loading
Loading
Loading

versioned_docs/version-v1.3/host/host.md

+13-1
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,16 @@ For admin users, you can click **Enable Maintenance Mode** to evict all VMs from
2525

2626
![node-maintenance.png](/img/v1.2/host/node-maintenance.png)
2727

28+
After a while the target node will enter maintenance mode successfully.
29+
30+
![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png)
31+
32+
:::note important
33+
34+
Check those [known limitations and workarounds](../troubleshooting/host.md#an-enable-maintenance-mode-node-stucks-on-cordoned-state) before you click this menu or you have encountered some issues.
35+
36+
:::
37+
2838
## Cordoning a Node
2939

3040
Cordoning a node marks it as unschedulable. This feature is useful for performing short tasks on the node during small maintenance windows, like reboots, upgrades, or decommissions. When you’re done, power back on and make the node schedulable again by uncordoning it.
@@ -42,6 +52,8 @@ Before removing a node from a Harvester cluster, determine if the remaining node
4252

4353
If the remaining nodes do not have enough resources, VMs might fail to migrate and volumes might degrade when you remove a node.
4454

55+
If you have some volumes which were created from the customized `StorageClass` with the value **1** of the [Number of Replicas](../advanced/storageclass.md#number-of-replicas), it is recommended to backup those single-replica volumes in advance. Otherwise, those volumes can't be rebuilt or restored from other nodes after this node is removed.
56+
4557
:::
4658

4759
### 1. Check if the node can be removed from the cluster.
@@ -522,4 +534,4 @@ status:
522534
```
523535

524536
The `harvester-node-manager` pod(s) in the `harvester-system` namespace may also contain some hints as to why it is not rendering a file to a node.
525-
This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest.
537+
This pod is part of a daemonset, so it may be worth checking the pod that is running on the node of interest.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
---
2+
sidebar_position: 6
3+
sidebar_label: Host
4+
title: "Host"
5+
---
6+
7+
<head>
8+
<link rel="canonical" href="https://docs.harvesterhci.io/v1.3/troubleshooting/host"/>
9+
</head>
10+
11+
## An enable-maintenance-mode Node Stucks on Cordoned State
12+
13+
After you click the **Enable Maintenance Mode** menu upon one Harvester host, the target host stucks on `Cordoned` state, and the **Enable Maintenance Mode** menu is available again, the expected menu **Disable Maintenance Mode** is not available.
14+
15+
![node-stuck-cordoned.png](/img/v1.3/troubleshooting/node-stuck-cordoned.png)
16+
17+
When you check the Harvester pod log, there are repeated messages like:
18+
19+
```
20+
time="2024-08-05T19:03:02Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7"
21+
time="2024-08-05T19:03:02Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget."
22+
23+
time="2024-08-05T19:03:07Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7"
24+
time="2024-08-05T19:03:07Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget."
25+
26+
time="2024-08-05T19:03:12Z" level=info msg="evicting pod longhorn-system/instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7"
27+
time="2024-08-05T19:03:12Z" level=info msg="error when evicting pods/\"instance-manager-68cd2514dd3f6d59b95cbd865d5b08f7\" -n \"longhorn-system\" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget."
28+
```
29+
30+
The Longhorn `instance-manager` uses pdb to protect itself from being evicted accidentally to avoid the data loss of volumes. When this error happens, it means the `instance-manager` pod is still serving some volumes/replicas.
31+
32+
There are some known causes and related workarounds.
33+
34+
### The Manually Attached Volume
35+
36+
When a Longhorn volume is attached to a host from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards), this volume will cause above the error.
37+
38+
You can check it from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards).
39+
40+
![attached-volume.png](/img/v1.3/troubleshooting/attached-volume.png)
41+
42+
The manually attached object is attached to a node name instead of the pod name.
43+
44+
You can also check it from CLI to get the CRD object `VolumeAttachment`.
45+
46+
The volume attached by Longhorn UI:
47+
48+
```
49+
- apiVersion: longhorn.io/v1beta2
50+
kind: VolumeAttachment
51+
...
52+
spec:
53+
attachmentTickets:
54+
longhorn-ui:
55+
id: longhorn-ui
56+
nodeID: node-name
57+
...
58+
volume: pvc-9b35136c-f59e-414b-aa55-b84b9b21ff89
59+
```
60+
61+
The volume attached by CSI driver:
62+
63+
```
64+
- apiVersion: longhorn.io/v1beta2
65+
kind: VolumeAttachment
66+
spec:
67+
attachmentTickets:
68+
csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf:
69+
id: csi-b5097155cddde50b4683b0e659923e379cbfc3873b5b2ee776deb3874102e9bf
70+
nodeID: node-name
71+
...
72+
volume: pvc-3c6403cd-f1cd-4b84-9b46-162f746b9667
73+
```
74+
75+
:::note
76+
77+
It is not recommended to attach a volume to the host manually.
78+
79+
:::
80+
81+
#### Workaround 1: Set Longhorn option `Detach Manually Attached Volumes When Cordoned` to True
82+
83+
The Longhorn option [Detach Manually Attached Volumes When Cordoned](https://longhorn.io/docs/1.6.0/references/settings/#detach-manually-attached-volumes-when-cordoned) defaults to `true`, it will block the node drain when there is any manually attached volume.
84+
85+
This options is available from Harvester v1.3.1 with the embedded Longhorn v1.6.0.
86+
87+
From Harvester v1.4.0, this option is set to `false` by default.
88+
89+
#### Workaround 2: Manually Detach the Volume
90+
91+
Detach the volume from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards).
92+
93+
![detached-volume.png](/img/v1.3/troubleshooting/detached-volume.png)
94+
95+
After that, the node will enter maintenance mode successfully.
96+
97+
![node-enter-maintenance-mode.png](/img/v1.3/troubleshooting/node-enter-maintenance-mode.png)
98+
99+
### The Single-replica Volume
100+
101+
Harvester supports to define the customized `StorageClass`, the [Number of Replicas](../advanced/storageclass.md#number-of-replicas) can even be 1 in some scenarios.
102+
103+
When such a volume was ever attached to a certain host by CSI driver or other ways, the last and only replica stays on this node even after the volume is detached from the node.
104+
105+
This can be checked from the CRD object `Volume`.
106+
107+
```
108+
- apiVersion: longhorn.io/v1beta2
109+
kind: Volume
110+
...
111+
spec:
112+
...
113+
numberOfReplicas: 1 // the replica number
114+
...
115+
status:
116+
...
117+
ownerID: nodeName
118+
...
119+
state: attached
120+
```
121+
122+
#### Workaround: Set Longhorn option `Node Drain Policy`
123+
124+
The Longhorn [Node Drain Policy](https://longhorn.io/docs/1.6.0/references/settings/#node-drain-policy) defaults to `block-if-contains-last-replica`. Longhorn will block the drain when the node contains the last healthy replica of a volume.
125+
126+
Set this option to `allow-if-replica-is-stopped` from the [Embedded Longhorn UI](./harvester.md#access-embedded-rancher-and-longhorn-dashboards) will solve this issue.
127+
128+
:::note important
129+
130+
If you plan to remove this node after it enters the maintenance mode, it is recommended to backup those single-replica volumes or redeploy the related workloads to other node in advance to get the volume scheduled to other node. Otherwise, those volumes can't be rebuilt or restored from other nodes after this node is removed.
131+
132+
:::
133+
134+
From Harvester v1.4.0, this option is set to `allow-if-replica-is-stopped` by default.

0 commit comments

Comments
 (0)