-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orchestrate scaling down Redpanda resource #90
Comments
A few comments:
|
Scaling here would be changing the number of active brokers in the redpanda cluster. IMO we shouldn't need to keep track of the original. We can always measure the active number of brokers with RPK or kubectl queries. Once the Spec has been updated, the operator should only focus on reconciling that. Rollbacks would need a lot more work.
I'm a bit split on this one. I like not duplicating logic when possible. We could instead rely on redpanda to not decommission nodes that it can't and instead push the responsibility back onto the users? |
One more thing, quorum is lost if there is less than (N+1)/2 nodes, and we can tolerate at most up to (N-1)/2 failures. Which means if we have (N+1)/2 failures then we have lost quorum and we can no longer read write. That said, we lose quorum if we replicate down to (N+1)/2 -1 = (N-1)/2 nodes. This i what I will be using. |
NIT
We can afford to lose Flor( (N-1)/2 ) Other than that I agree. |
Scope has changed a bit, now we want to also mantain quorum of topic partitions. |
Addressed in #102 |
For this ticket, what we will do for now is adding the quorum validation check we have in the current PR. I will create a new ticket discussing the issue we should be fixing which is scaling down in a controlled fashion. We should probably do this once we move away from flux. |
After some testing, i am checking to see if there is a quick and simple win where we cannot scale below the min replication factor. |
If user would scale down more than (N/2 + 1) where N is the replication factor, then Redpanda will lost Raft quorum and it will be unable to serve any decommission request. Operator should handle this gracefully by scaling down using (N/2+1) formula and wait for the full decommission of old nodes.
JIRA Link: K8S-197
The text was updated successfully, but these errors were encountered: