Skip to content

Commit efda82f

Browse files
authored
Add troubleshooting of Storage Network (#606)
Signed-off-by: Jian Wang <[email protected]>
1 parent 1b102dc commit efda82f

File tree

2 files changed

+183
-0
lines changed

2 files changed

+183
-0
lines changed

docs/advanced/storagenetwork.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,97 @@ metadata:
290290
Omitted...
291291
```
292292

293+
#### Step 4
294+
295+
The storage network is dedicated to [internal communication between Longhorn pods](#same-physical-interfaces), resulting in high performance and reliability. However, the storage network still relies on the [external network infrastructure](../networking/deep-dive.md#external-networking) for connectivity (similar to how the [VM VLAN network](../networking/harvester-network.md#create-a-vm-with-vlan-network) functions). When the external network is not connected and configured correctly, you may encounter the following issues:
296+
297+
- The newly created VM becomes stuck at the `Not-Ready` state.
298+
299+
- The `longhorn-manager` pod logs include error messages.
300+
301+
Example:
302+
303+
```
304+
longhorn-manager-j6dhh/longhorn-manager.log:2024-03-20T16:25:24.662251001Z time="2024-03-20T16:25:24Z" level=error msg="Failed rebuilding of replica 10.0.16.26:10000" controller=longhorn-engine engine=pvc-0a151c59-ffa9-4938-9c86-59ebb296bc88-e-c2a7fe77 error="proxyServer=10.52.6.33:8501 destination=10.0.16.23:10000: failed to add replica tcp://10.0.16.26:10000 for volume: rpc error: code = Unknown desc = failed to get replica 10.0.16.26:10000: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.16.26:10000: connect: no route to host\"" node=oml-harvester-9 volume=pvc-0a151c59-ffa9-4938-9c86-59ebb296bc88
305+
```
306+
307+
To test the communication between Longhorn pods, perform the following steps:
308+
309+
4.1 Obtain the storage network IP of each Longhorn Instance Manager pod identified in the previous step.
310+
311+
Example:
312+
313+
```
314+
instance-manager-r-43f1624d14076e1d95cd72371f0316e2
315+
storage network IP: 10.0.16.8
316+
317+
instance-manager-e-ba38771e483008ce61249acf9948322f
318+
storage network IP: 10.0.16.14
319+
```
320+
321+
4.2 Log in to those pods.
322+
323+
When you run the command `ip addr`, the output includes IPs that are identical to IPs in the pod annotations. In the following example, one IP is for the pod network, while the other is for the storage network.
324+
325+
Example:
326+
327+
```
328+
$ kubectl exec -i -t -n longhorn-system instance-manager-e-ba38771e483008ce61249acf9948322f -- /bin/sh
329+
330+
$ ip addr
331+
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
332+
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
333+
inet 127.0.0.1/8 scope host lo
334+
...
335+
3: eth0@if2277: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
336+
link/ether 0e:7c:d6:77:44:72 brd ff:ff:ff:ff:ff:ff link-netnsid 0
337+
inet 10.52.6.146/32 scope global eth0
338+
...
339+
4: lhnet1@if2278: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
340+
link/ether fe:92:4f:fb:dd:20 brd ff:ff:ff:ff:ff:ff link-netnsid 0
341+
inet 10.0.16.14/20 brd 10.0.31.255 scope global lhnet1
342+
...
343+
344+
$ ip route
345+
default via 169.254.1.1 dev eth0
346+
10.0.16.0/20 dev lhnet1 proto kernel scope link src 10.0.16.14
347+
169.254.1.1 dev eth0 scope link
348+
```
349+
350+
4.3 Start a simple HTTP server in one pod.
351+
352+
Example:
353+
354+
```
355+
$ python3 -m http.server 8000 --bind 10.0.16.14 (replace with your pod storage network IP)
356+
```
357+
358+
:::note
359+
360+
Explicitly bind the simple HTTP server to the storage network IP.
361+
362+
:::
363+
364+
4.4 Test the HTTP server in another pod.
365+
366+
Example:
367+
368+
```
369+
From instance-manager-r-43f1624d14076e1d95cd72371f0316e2 (IP 10.0.16.8)
370+
371+
$ curl http://10.0.16.14:8000
372+
```
373+
374+
When the storage network is functioning correctly, the `curl` command returns a list of files on the HTTP server.
375+
376+
4.5 (Optional) Troubleshoot issues.
377+
378+
The storage network may malfunction because of issues with the external network, such as the following:
379+
380+
- Physical NICs (installed on Harvester nodes) that are associated with the storage network were not added to the same VLAN in the external switches.
381+
382+
- The external switches are not correctly connected and configured.
383+
293384

294385
### Start VM Manually
295386

versioned_docs/version-v1.3/advanced/storagenetwork.md

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -290,6 +290,98 @@ metadata:
290290
Omitted...
291291
```
292292

293+
#### Step 4
294+
295+
The storage network is dedicated to [internal communication between Longhorn pods](#same-physical-interfaces), resulting in high performance and reliability. However, the storage network still relies on the [external network infrastructure](../networking/deep-dive.md#external-networking) for connectivity (similar to how the [VM VLAN network](../networking/harvester-network.md#create-a-vm-with-vlan-network) functions). When the external network is not connected and configured correctly, you may encounter the following issues:
296+
297+
- The newly created VM becomes stuck at the `Not-Ready` state.
298+
299+
- The `longhorn-manager` pod logs include error messages.
300+
301+
Example:
302+
303+
```
304+
longhorn-manager-j6dhh/longhorn-manager.log:2024-03-20T16:25:24.662251001Z time="2024-03-20T16:25:24Z" level=error msg="Failed rebuilding of replica 10.0.16.26:10000" controller=longhorn-engine engine=pvc-0a151c59-ffa9-4938-9c86-59ebb296bc88-e-c2a7fe77 error="proxyServer=10.52.6.33:8501 destination=10.0.16.23:10000: failed to add replica tcp://10.0.16.26:10000 for volume: rpc error: code = Unknown desc = failed to get replica 10.0.16.26:10000: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = \"transport: Error while dialing dial tcp 10.0.16.26:10000: connect: no route to host\"" node=oml-harvester-9 volume=pvc-0a151c59-ffa9-4938-9c86-59ebb296bc88
305+
```
306+
307+
To test the communication between Longhorn pods, perform the following steps:
308+
309+
4.1 Obtain the storage network IP of each Longhorn Instance Manager pod identified in the previous step.
310+
311+
Example:
312+
313+
```
314+
instance-manager-r-43f1624d14076e1d95cd72371f0316e2
315+
storage network IP: 10.0.16.8
316+
317+
instance-manager-e-ba38771e483008ce61249acf9948322f
318+
storage network IP: 10.0.16.14
319+
```
320+
321+
4.2 Log in to those pods.
322+
323+
When you run the command `ip addr`, the output includes IPs that are identical to IPs in the pod annotations. In the following example, one IP is for the pod network, while the other is for the storage network.
324+
325+
Example:
326+
327+
```
328+
$ kubectl exec -i -t -n longhorn-system instance-manager-e-ba38771e483008ce61249acf9948322f -- /bin/sh
329+
330+
$ ip addr
331+
$ ip addr
332+
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
333+
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
334+
inet 127.0.0.1/8 scope host lo
335+
...
336+
3: eth0@if2277: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
337+
link/ether 0e:7c:d6:77:44:72 brd ff:ff:ff:ff:ff:ff link-netnsid 0
338+
inet 10.52.6.146/32 scope global eth0
339+
...
340+
4: lhnet1@if2278: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
341+
link/ether fe:92:4f:fb:dd:20 brd ff:ff:ff:ff:ff:ff link-netnsid 0
342+
inet 10.0.16.14/20 brd 10.0.31.255 scope global lhnet1
343+
...
344+
345+
$ ip route
346+
default via 169.254.1.1 dev eth0
347+
10.0.16.0/20 dev lhnet1 proto kernel scope link src 10.0.16.14
348+
169.254.1.1 dev eth0 scope link
349+
```
350+
351+
4.3 Start a simple HTTP server in one pod.
352+
353+
Example:
354+
355+
```
356+
$ python3 -m http.server 8000 --bind 10.0.16.14 (replace with your pod storage network IP)
357+
```
358+
359+
:::note
360+
361+
Explicitly bind the simple HTTP server to the storage network IP.
362+
363+
:::
364+
365+
4.4 Test the HTTP server in another pod.
366+
367+
Example:
368+
369+
```
370+
From instance-manager-r-43f1624d14076e1d95cd72371f0316e2 (IP 10.0.16.8)
371+
372+
$ curl http://10.0.16.14:8000
373+
```
374+
375+
When the storage network is functioning correctly, the `curl` command returns a list of files on the HTTP server.
376+
377+
4.5 (Optional) Troubleshoot issues.
378+
379+
The storage network may malfunction because of issues with the external network, such as the following:
380+
381+
- Physical NICs (installed on Harvester nodes) that are associated with the storage network were not added to the same VLAN in the external switches.
382+
383+
- The external switches are not correctly connected and configured.
384+
293385

294386
### Start VM Manually
295387

0 commit comments

Comments
 (0)