-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operator v1: use uncached client for Get Cluster, to avoid races #314
Conversation
c626639
to
f9ae0e2
Compare
Controller-runtime forbids returning both result and error. However, ar.Ensure() code provoked such cases; because it reconciles multiple "inner" resources, and some of them might return an error, in addition to some other returning a ctrl.Result. to over come this, ar.Ensure now makes the deliberate choice to prioritize errors - if any non-requeue error is returned by an attached resource, it will return just the error, and ignore the ctrl.Result. in addition, one case - where a STS cannot reconcile, because "the other STS" is being decommissioned - was changed to not return an error. It's not an error we need to report, it's really just an ask for a requeue to make sure it gets reconciled, after the condition has settled.
It's possible that Get returns a cached, stale value - even if the missed change was performed in the same process. This is because of various trade-offs made in controller-runtime's cache layers. This brings the controller in a dangerous situtation, in practice it happens that a previous run reduced status.currentReplicas from 3 to 2, but the next run reads a stale object and sees 3. To avoid this bug, and other bugs we can not yet foresee, we will perform this one Get with an uncached read. It is a trade-off, where we accept a performance hit, and increased load on kube-api server, but we value the consistency higher. This is limited to this one, most important Get call, of Cluster resource only (one per entire Reconcile). Since controller-runtime has throttling built in, and it's only "that on Cluster Get call", we believe this trade-off is acceptable, and the benefits outweight the cost.
f9ae0e2
to
12db325
Compare
From my experience the problem statement should be solved at the reconciliation logic and not by removing cache. I understand that there might be more code paths that does not correctly handle re-queue after Patch call. I would stick to option D) This might be the case where Status is not perfect to store the observed state. |
I'll take a look in a bit to be able to fully opine as well (since I haven't read the code/problem statement carefully), but just as a gut instinct the solution of "let's remove the cache" is, IMO, not the correct decision like 99+% of the time due to the additional networking overhead you'll be paying for not just explicitly handling potential stale cache reads and just re-queueing reconciliation accordingly when the cache is updated. |
Based off some issues I had been running into with the decommission controller, I do lean towards "let's not use a cache". However, not using a cache is not a magic bullet as Kubernetes is, by nature, an eventually consistent system. There's not getting around that. IMO fetching the same resource more than once at the start of reconciliation is an anti-pattern. It will constantly lead us to situations like this and will further be exacerbated by having multiple controllers acting on the same resource. We'll not be solving this in a single PR but we should be working towards it when possible. My preference is to not read from IIUC, the informers interface exposes some method of forcing a synchronization for cases like this but I also need to dig around the details thereof. We're not going to have two instances of |
I actually wouldn't be opposed to E) assuming we generalize it as a more reliable form of cache that's usable by Is there a reason we can't just consult the Admin API to inform the controller instead of reading from |
status.currentReplicas is some key element that steers removal of pods. it's something else than the number of existing brokers. Regarding D): E): Will require some more refactoring, so the resource version created by Update/Patch can be returned "somewhere" where it gets used. Basically, i would like to know if you think doing a single uncached Get call is causing a performance hit bad enough to justify the extra work; given this is in the deprecated operator. Requiring read-after-write semantics for a local change is not unreasonable, no matter if eventually consistent system or not. it's indeed not optimal that currentReplicas steers how many brokers we want to have, but it's a core fact of v1, if we like it or not.. |
Hey @birdayz before any sort of merge I think I'd really like to fully understand the problem you're trying to solve. I understand that stale cache reads are a thing, but specifically I'm trying to understand why they are problematic in your case, because as @chrisseto mentioned, controllers, by definition are meant to be eventually consistent.
So, I'm trying to figure out what the exact problem is here. Is
If either of those isn't the case and we're handling brokers going away during whatever operation
If any of the 4 above points are true, it tells me that those are the real problem and not the stale cache read and we should handle them accordingly. That said, I could also be missing context as I'm very unfamiliar with the v1 code at this point. |
Thanks andrew for taking a closer look.
No. This is not about reconciliation not getting triggered.
kind of. it's not about the pod; it's about the RP broker. pod exists, but rp broker not, as it was removed already. We're incorrectly setting decommissionBrokerID AGAIN - a second time - for a broker that was just decommissioned. Example:
They are deterministic.
Yes, it's not idempotent. seeing currentReplicas=2, reconciling, seeing currentReplicas=3 again, and then currentReplicas=2 again, do not result in the same outcome, because it is not aligned with redpanda's view of the world. Please have a look at the patches below, they try to make it more idempotent. To move this forward quickly, i've sent PRs with two other options. Option C: Option D: I could see another variant that discards the brokerID from status, if the broker is not in redpanda admin api's broker list here:
Note: these are really patches for this specific symptom of the overall problem. It's quite possible that other code relies on not seeing an old currentReplicas "come back" after an update to that field. That's why i am interested in getting read-after-write semantics for currentReplicas: it is not feasible to retrofit a lot of tests in operator v1 to ensure we're covering all cases, and stale reads of currentReplicas don't cause problems. not even strict consistency, just seeing a change the operator made in the local process. that's much easier in v2 operator, where there's a single status patch at the end of the loop; in v1 it's very ugly because there's status patches/updates all over the place. |
Closing in favor of #317 |
This PR is targeted at jb/nodepools-requeue until we're merging that one.
Problem statement
I've been debugging a rare condition - in about 1-2% of cloud certification runs (need about 50-100 clusters, with 6 decommissions each) where we decom brokers and move them over to another NodePool, where the cluster ends up with status.decommissionBrokerID being set, but that broker has been decommissioned already.
All happens within minute 06:27 UTC.
(Logs are available internally)
(Following times are the second within 06:27)
52.579: Cleared decom in status
52.651: set currentReplicas to 2 (down from 3)
52.686: Finished reconcile loop
52.689: Start of new loop, Get Cluster returns currentReplicas 3 (stale read. it's supposed to be 2). Initiates a new Scale-down - because it read currentReplicas 3 (outdated), but replicas is 2. That scale down was already performed previously, and has finished (decom finished), but the pod was not removed yet.
53.154: Patch status (to update observedGen) fails, because it would change currentReplicas to 3 - which is wrong, as the value 3 is based on a stale read, and kube API server rejects it as it's a conflict)
Variants how to fix this
(In this PR, we're picking A, but i am very open to other variants)
A)
At the start of the reconcile loop, always perform an uncached Get to read Cluster resource. this ensures that all actions are based on up to date data. even more so - it ensures the happens-before relationship between write (Patch call in previous reconcile) and and the next iteration exists.
B)
After every Patch(), loop to run Get() until we see the values changed. Then, the next reconcile loop will not have a stale cache, because we've read back the data before.
i am not sure about corner cases; IMHO it feels quite brittle.
C)
Change writing of status changes to Update or use resource version in Patch, to fail update calls that are based on old data.
This will work as well, but it will fail late - when doing status changes while working off stale data already.
i fear that other code paths can make bad decisions if they act based on old data; especially status.currentReplicas is very dangerous - code can easily make decision we will very much regret (eg. eliminating a node). Currently, this would be fine i think, because a Patch/Update call will set DecomBrokerID in status, and would fail. however, it's unpredictable, if code changes, and some actions are first performed outside of kube api (eg. a call to RP admin api).
D)
Update code so it does not get stuck forever, e.g. update code to check if a broker has been decommissioned already, before setting status.decommissionBrokerID. Most of problems of C) still apply; it does not solve this generally, but only for the specifically observed case.
E)
We could store in-memory in ClusterReconciler the last seen resource version (eg. returned by Patch), and use this ResourceVersion in reconcile's Get Cluster. This would guarantee we get "no older than" this version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-get-and-list
Note: controller-runtime does not locally invalidate the cache if Patch/Update was done, it relies on watches to do this, so it's async and no happens-before relationship between Patch/Update and subsequent Get exists.
Refs