Handling unexpected k8s node termination in kubernetes #11742

mdfaizsiddiqui · 2023-06-27T23:32:03Z

mdfaizsiddiqui
Jun 27, 2023

Hey

I'm running Redpanda on kubernetes cluster (Server Version: v1.27.2-eks-c12679a) with local-storage persistent volumes.

I wanted to know what is the mechanism to recover Redpanda cluster when a k8s worker node goes down unexpectedly and we bring a new replacement node for it. Understand that in my condition the redpanda data resides in local-storage pv and the data will be lost if node goes down.

Currently I see these errors on redpanda broker pods -

WARN  2023-06-27 17:14:37,716 [shard 0] raft - [group_id:0, {redpanda/controller/0}] consensus.h:623 - received append_entries request addressed to different node: {id: {2}, revision: {0}}, current node: {id: {4}, revision: {0}}, source: {id: {0}, revision: {0}}

Cluster info looks like -

kubectl -n astradot exec -ti redpanda-0 -c redpanda -- \
  rpk cluster info
CLUSTER
=======
redpanda.451312de-9afa-429c-ac15-6eaa2342806f

BROKERS
=======
ID    HOST                                             PORT  RACK
0*    redpanda-0.redpanda.astradot.svc.cluster.local.  9093  us-east-1a
1     redpanda-1.redpanda.astradot.svc.cluster.local.  9093  us-east-1b
4     redpanda-2.redpanda.astradot.svc.cluster.local.  9093  us-east-1c

TOPICS
======
NAME      PARTITIONS  REPLICAS
_schemas  1

I've raised another question on similar lines #11743, but here I am testing it by intentionally deleting the k8s worker node on which a broker pod is deployed.

Answered by mmaslankaprv

Jul 17, 2023

In Redpanda we decided to make replication factor an invariant. This way to perform node decommissioning one needs additional capacity in the cluster to keep that invariant. In this particular case a fourth node is required. All the partitions will be replicated first to the newly added node and only then the node will be removed from the cluster. If there is enough capacity in the cluster the decommissioning process will finish regardless of state of the node being decommissioned.

View full answer

bharathv · 2023-06-28T05:28:01Z

bharathv
Jun 28, 2023
Collaborator

What redpanda version are you on?

I'll let the k8s experts chime in but the warning here means that node with id=4 is using the same hostname as dead node with id=2. The dead node is still a part of the cluster, so the RPCs meant for id=2 are now hitting id=4 and this check that logs the WARN blocks them. It would be ideal to decommission a broker before reusing it's hostname.

1 reply

mdfaizsiddiqui Jun 29, 2023
Author

This is the version I am currently on: v23.1.13. Also, I understand that we have a procedure around decommissioning a broker node but my whole question is around how to tackle unexpected events like k8s worker node termination on which the broker pod was running.

RafalKorepta · 2023-06-28T11:16:27Z

RafalKorepta
Jun 28, 2023
Maintainer

I don't know which deployment option you are using operator or helm chart?

From the rpk cluster info point of view where Redpanda Node ID 2 and 3 was successfully decommissioned I guess it was Operator.

Could you provide more logs from operator?
Could you provide more information for the reproducible steps?

13 replies

RafalKorepta Jul 17, 2023
Maintainer

I returned from short break.

There seems to be a more fundamental issue here with how RP and K8s StatefulSet work.

Redpanda wants there to be atleast 3 nodes at all times to get the decommission command to work for that topic with 3 replicas.

It's not an issue of K8S nor RP, but the user intention. If you want to decommission one node, but Redpanda doesn't have enough capacity to replicate at least one partition, then decommission is not possible to achieve.

The only way to do that from k8s perspective is to temporarily increase the StatefulSet replica count from 3 to 4 and then do decomission on a node and delete. However, doing that then means StatefulSet now expects pods 0,1,2,3 and will keep complaining until it has all 4 pods. Scaling down wont work either cause the new pod redpanda-3 has all the data and thats the one StatefulSet will want to kill.

You are right that you will not be able to scale down until you give StatefulSet room to schedule all Pods. Once they are ready you can scale down, just decommission last node (e.g. redpanda-3) first.

pdeva Jul 17, 2023

just decommission last node (e.g. redpanda-3) first.

that would only work when its you intentionally doing the decommissioning right? what if say the ec2 node running redpanda-2 had some unexpected hardware issue. then redpanda-2 would have to be decommissioned, this bringing us back to the issue above.

pdeva Jul 17, 2023

You are right that you will not be able to scale down until you give StatefulSet room to schedule all Pods. Once they are ready you can scale down, just decommission last node (e.g. redpanda-3) first.

actually this wont work either. redpanda-3 is the newly created 'temporary' node. Here is the scenario that explains this:

I want a RP cluster of 3 nodes. So i create a statefulset of 3. Now i have pods redpanda-0, redpanda-1, redpanda-2.
I want to recreate the nodes for some reason, example, upgrading the underlying eks ami.
Going by your suggestion, I would need to start by decomissioning redpanda-2 first.
However, I need to upgrade my statefulset size to 4 before step 3, that so redpanda-2 can be succesfully decomissioned.
I do so, and now I have redpanda-0, redpanda-1, redpanda-2, redpanda-3. redpanda-3 is the newly created node and has no data. Now I go ahead and decomission redpanda-2 as i wanted to in Step 3.
I do that and decomissioning redpanda-2 is indeed sucessful. After that, I go ahead and delete node containing redpanda-2.
However, now I have redpanda-0, redpanda-1, redpanda-3 in my cluster. And now my StatefulSet is complaining about redpanda-2 not existing since it wants a 4 node statefulset.

mmaslankaprv Jul 17, 2023
Maintainer

In Redpanda we decided to make replication factor an invariant. This way to perform node decommissioning one needs additional capacity in the cluster to keep that invariant. In this particular case a fourth node is required. All the partitions will be replicated first to the newly added node and only then the node will be removed from the cluster. If there is enough capacity in the cluster the decommissioning process will finish regardless of state of the node being decommissioned.

Answer selected by mmaslankaprv

pdeva Jul 17, 2023

sure. i do understand how things work from redpanda's perspective. the issue is how to get K8s stateful set to work with that process. i have described the exact issue above.

RafalKorepta Jul 18, 2023
Maintainer

@pdeva

redpanda-3 is the newly created node and has no data.

It's not true if you enable balancing.

I do that and decomissioning redpanda-2 is indeed sucessful. After that, I go ahead and delete node containing redpanda-2.

It depends what storage provider you are using. If it is network based like AWS EBS and new Kubernetes Node in the same availability zone, then you don't need to decommissioning. If you are using local disks after you delete Kubernetes Node you need to delete Persistent Volume Claim to unbound Pod from, now not existent Kubernetes Node. In order to scale down the StatefulSet you need to provide additional Kubernetes Node, just to satisfy StatefulSet. After Pods become Ready, then you can scale down and remove redundant Kuberentes Node where the redpanda-3 Pod was scheduled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling unexpected k8s node termination in kubernetes #11742

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 14 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Handling unexpected k8s node termination in kubernetes #11742

mdfaizsiddiqui Jun 27, 2023

Replies: 2 comments · 14 replies

bharathv Jun 28, 2023 Collaborator

mdfaizsiddiqui Jun 29, 2023 Author

RafalKorepta Jun 28, 2023 Maintainer

RafalKorepta Jul 17, 2023 Maintainer

pdeva Jul 17, 2023

pdeva Jul 17, 2023

mmaslankaprv Jul 17, 2023 Maintainer

pdeva Jul 17, 2023

RafalKorepta Jul 18, 2023 Maintainer

mdfaizsiddiqui
Jun 27, 2023

Replies: 2 comments 14 replies

bharathv
Jun 28, 2023
Collaborator

mdfaizsiddiqui Jun 29, 2023
Author

RafalKorepta
Jun 28, 2023
Maintainer

RafalKorepta Jul 17, 2023
Maintainer

mmaslankaprv Jul 17, 2023
Maintainer

RafalKorepta Jul 18, 2023
Maintainer