Taints are not removed from nodes #23

DipanshuSehjal · 2022-10-03T15:02:51Z

Version - HA controller 1.1.0

Many times we have seen that taints are not removed from nodes so, pods are not scheduled. Moreover, the taints come back on nodes as soon as you remove them manually.
Taints mostly occur when nodes are in stage of rebooting for example, during node upgrade and reboot.
Additionally, both replicas of 2 resources also went into Outdated state.

For instance,

[~]# kubectl describe node | grep -i taint
Taints:             drbd.linbit.com/lost-quorum:NoSchedule
Taints:             drbd.linbit.com/force-io-error:NoSchedule
Taints:             drbd.linbit.com/lost-quorum:NoSchedule

| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode1.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Outdated | 2022-09-12 14:20:06 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode2.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Outdated | 2022-09-12 14:20:02 |
| pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 | env13-clusternode3.galwayan.com | 7000 | Unused | Ok                                                                                                      |   Diskless | 2022-09-12 14:20:04 |

Settings defined in storage class as per HA-controller requirements -

  DrbdOptions/auto-quorum: suspend-io
  DrbdOptions/Resource/on-no-data-accessible: suspend-io
  DrbdOptions/Resource/on-suspended-primary-outdated: force-secondary
  DrbdOptions/Net/rr-conflict: retry-connect

The text was updated successfully, but these errors were encountered:

WanzenBug · 2022-10-03T15:34:54Z

Taints mostly occur when nodes are in stage of rebooting for example, during node upgrade and reboot.

That is expected, as the taints (at least the drbd.linbit.com/lost-quorum ones) are added when one node looks unreachable from another. During a reboot this is obviously the case. The question is why the taint is not removed after the node is back online and the satellite + DRBD is running again.

In the above case, its probably related to the outdated pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1 resource. Could you please collect kernel logs on all 3 nodes for that resource: journalctl -t kernel --grep pvc-646fa87b-aeb2-4c51-924d-7019d1a5f0b1.

Lastly, there is one drbd.linbit.com/force-io-error taint, which would indicate that one of the nodes has the drbd device open, but is currently trying to become secondary. Could you check which node has that taint and see what's up with that resource? The output of drbdsetup status -v on that node should also show force-io-error:yes

bmalynovytch · 2023-01-27T11:35:38Z

Version - HA controller 1.1.0

Many times we have seen that taints are not removed from nodes so, pods are not scheduled. Moreover, the taints come back on nodes as soon as you remove them manually. Taints mostly occur when nodes are in stage of rebooting for example, during node upgrade and reboot. Additionally, both replicas of 2 resources also went into Outdated state.

Same problem here with v1.1.1, but no other taints except drbd.linbit.com/lost-quorum on 2 nodes. What's strange is that these nodes both hold replicas of an UpToDate volume (same volume). 🤷

WanzenBug · 2023-01-27T13:46:27Z

Which DRBD version are you using? And can you check with drbdadm status, perhaps they are reporting quorum:no. I think there is a bug in DRBD with the latest releases that could cause this issue.

bmalynovytch · 2023-01-27T14:20:13Z

You're right, drbdadm status gives a quorum:no 😭

DRBDADM_BUILDTAG=GIT-hash:\ 409097fe02187f83790b88ac3e0d94f3c167adab\ build\ by\ @buildsystem\,\ 2022-09-19\ 12:15:08
DRBDADM_API_VERSION=2
DRBD_KERNEL_VERSION_CODE=0x090201
DRBD_KERNEL_VERSION=9.2.1
DRBDADM_VERSION_CODE=0x091600
DRBDADM_VERSION=9.22.0

bmalynovytch · 2023-01-27T14:21:36Z

Could be related to LINBIT/drbd#52

bmalynovytch · 2023-01-27T14:33:25Z

Ok, so messing around with more or less arbitors (kubectl linstor resource create worker-XYZ pvc-XYZ --drbd-diskless) allows me to make the quorum value on (when less than 2 diskful + 3 diskless) and off (when at least 2 diskful + 3 diskless).

bmalynovytch · 2023-01-31T08:58:11Z

Ok, so messing around with more or less arbitors (kubectl linstor resource create worker-XYZ pvc-XYZ --drbd-diskless) allows me to make the quorum value on (when less than 2 diskful + 3 diskless) and off (when at least 2 diskful + 3 diskless).

The trick isn't stable as the operator will at some point delete extra arbiters. Better solution, while not ideal, is to force quorum on the volume to the number of data nodes.

bmalynovytch mentioned this issue Jan 31, 2023

Parameter "disklessOnRemaining" not honored piraeusdatastore/linstor-csi#183

Open

js185692 mentioned this issue Oct 7, 2024

Add the ability to disable nodes from being tainted #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Taints are not removed from nodes #23

Taints are not removed from nodes #23

DipanshuSehjal commented Oct 3, 2022

WanzenBug commented Oct 3, 2022

bmalynovytch commented Jan 27, 2023

WanzenBug commented Jan 27, 2023

bmalynovytch commented Jan 27, 2023

bmalynovytch commented Jan 27, 2023 •

edited

Loading

bmalynovytch commented Jan 27, 2023

bmalynovytch commented Jan 31, 2023

Taints are not removed from nodes #23

Taints are not removed from nodes #23

Comments

DipanshuSehjal commented Oct 3, 2022

WanzenBug commented Oct 3, 2022

bmalynovytch commented Jan 27, 2023

WanzenBug commented Jan 27, 2023

bmalynovytch commented Jan 27, 2023

bmalynovytch commented Jan 27, 2023 • edited Loading

bmalynovytch commented Jan 27, 2023

bmalynovytch commented Jan 31, 2023

bmalynovytch commented Jan 27, 2023 •

edited

Loading