Orchestrate scaling down Redpanda resource #90

RafalKorepta · 2024-03-12T14:29:05Z

If user would scale down more than (N/2 + 1) where N is the replication factor, then Redpanda will lost Raft quorum and it will be unable to serve any decommission request. Operator should handle this gracefully by scaling down using (N/2+1) formula and wait for the full decommission of old nodes.

JIRA Link: K8S-197

alejandroEsc · 2024-03-21T13:45:52Z

A few comments:

Do we need to store somewhere the original size of the cluster. Cluster info can show you which nodes are down, and can be used to keep track of scaling down both the replicas and the number of nodes needed in a quorum right? I think this is where Redpanda is a bit special since you can decommission nodes, effectively changing this formula. So, what do we mean by scaling down here? Do we mean literally just scaling but not necessarily changing the number of nodes commissioned?
Are we ok with using validatingWebhooks? This would still require knowing how many nodes are comissioned.

chrisseto · 2024-03-21T14:19:22Z

Scaling here would be changing the number of active brokers in the redpanda cluster.

IMO we shouldn't need to keep track of the original. We can always measure the active number of brokers with RPK or kubectl queries. Once the Spec has been updated, the operator should only focus on reconciling that. Rollbacks would need a lot more work.

Are we ok with using validatingWebhooks? This would still require knowing how many nodes are comissioned.

I'm a bit split on this one. I like not duplicating logic when possible. We could instead rely on redpanda to not decommission nodes that it can't and instead push the responsibility back onto the users?

alejandroEsc · 2024-03-25T13:25:10Z

One more thing, quorum is lost if there is less than (N+1)/2 nodes, and we can tolerate at most up to (N-1)/2 failures. Which means if we have (N+1)/2 failures then we have lost quorum and we can no longer read write.

That said, we lose quorum if we replicate down to (N+1)/2 -1 = (N-1)/2 nodes. This i what I will be using.

RafalKorepta · 2024-03-25T13:59:56Z

NIT

That said, we lose quorum if we replicate down to (N+1)/2 -1 = (N-1)/2 nodes.

We can afford to lose Flor( (N-1)/2 )

Other than that I agree.

alejandroEsc · 2024-03-27T13:34:05Z

Scope has changed a bit, now we want to also mantain quorum of topic partitions.

alejandroEsc · 2024-03-28T16:23:56Z

Addressed in #102

alejandroEsc · 2024-03-28T16:27:25Z

For this ticket, what we will do for now is adding the quorum validation check we have in the current PR. I will create a new ticket discussing the issue we should be fixing which is scaling down in a controlled fashion. We should probably do this once we move away from flux.

alejandroEsc · 2024-04-01T20:58:29Z

After some testing, i am checking to see if there is a quick and simple win where we cannot scale below the min replication factor.

chrisseto mentioned this issue Mar 21, 2024

Scaling down replicas in v2 is not working as expected #87

Closed

chrisseto assigned alejandroEsc Mar 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestrate scaling down Redpanda resource #90

Orchestrate scaling down Redpanda resource #90

RafalKorepta commented Mar 12, 2024 •

edited by jira bot

Loading

alejandroEsc commented Mar 21, 2024

chrisseto commented Mar 21, 2024

alejandroEsc commented Mar 25, 2024

RafalKorepta commented Mar 25, 2024

alejandroEsc commented Mar 27, 2024

alejandroEsc commented Mar 28, 2024

alejandroEsc commented Mar 28, 2024

alejandroEsc commented Apr 1, 2024

Orchestrate scaling down Redpanda resource #90

Orchestrate scaling down Redpanda resource #90

Comments

RafalKorepta commented Mar 12, 2024 • edited by jira bot Loading

alejandroEsc commented Mar 21, 2024

chrisseto commented Mar 21, 2024

alejandroEsc commented Mar 25, 2024

RafalKorepta commented Mar 25, 2024

alejandroEsc commented Mar 27, 2024

alejandroEsc commented Mar 28, 2024

alejandroEsc commented Mar 28, 2024

alejandroEsc commented Apr 1, 2024

RafalKorepta commented Mar 12, 2024 •

edited by jira bot

Loading