Handling unexpected k8s node termination in kubernetes #11742
-
Hey I'm running Redpanda on kubernetes cluster ( I wanted to know what is the mechanism to recover Redpanda cluster when a k8s worker node goes down unexpectedly and we bring a new replacement node for it. Understand that in my condition the redpanda data resides in local-storage pv and the data will be lost if node goes down. Currently I see these errors on redpanda broker pods -
Cluster info looks like -
I've raised another question on similar lines #11743, but here I am testing it by intentionally deleting the k8s worker node on which a broker pod is deployed. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 14 replies
-
What redpanda version are you on? I'll let the k8s experts chime in but the warning here means that node with id=4 is using the same hostname as dead node with id=2. The dead node is still a part of the cluster, so the RPCs meant for id=2 are now hitting id=4 and this check that logs the WARN blocks them. It would be ideal to decommission a broker before reusing it's hostname. |
Beta Was this translation helpful? Give feedback.
-
I don't know which deployment option you are using operator or helm chart? From the Could you provide more logs from operator? |
Beta Was this translation helpful? Give feedback.
In Redpanda we decided to make replication factor an invariant. This way to perform node decommissioning one needs additional capacity in the cluster to keep that invariant. In this particular case a fourth node is required. All the partitions will be replicated first to the newly added node and only then the node will be removed from the cluster. If there is enough capacity in the cluster the decommissioning process will finish regardless of state of the node being decommissioned.