Operator v1: use uncached client for Get Cluster, to avoid races #314

birdayz · 2024-11-17T14:12:36Z

This PR is targeted at jb/nodepools-requeue until we're merging that one.

Problem statement

I've been debugging a rare condition - in about 1-2% of cloud certification runs (need about 50-100 clusters, with 6 decommissions each) where we decom brokers and move them over to another NodePool, where the cluster ends up with status.decommissionBrokerID being set, but that broker has been decommissioned already.

The initial Get Cluster in the reconcile loop gets a stale value returned, that does not contain a previous (single digit MS previously) Patch status change.
We changed status Update calls to Patch calls; this was requested as code review comments in multiple PRs (and i understand, its generally more desirable)
This has some negative side effects; Patch updates do not fail with Conflict, if the specific fields updated are not conflicting. Update calls however do fail, always, if there was a concurrent change. In practice, we're updating setDecommission broker ID - based on incorrect currentReplicas from a stale Get Cluster. (bad decision; that broker has finished decom already)
All of this behavior was shadowed by other bugs, e.g. in a concrete case, we had a bug where we returned both a ctrl.Result AND an error. this is not allowed, controller-runtime then performs an exponential backoff. the stale read does in practice not happen then.

All happens within minute 06:27 UTC.

(Logs are available internally)

(Following times are the second within 06:27)

52.579: Cleared decom in status
52.651: set currentReplicas to 2 (down from 3)
52.686: Finished reconcile loop
52.689: Start of new loop, Get Cluster returns currentReplicas 3 (stale read. it's supposed to be 2). Initiates a new Scale-down - because it read currentReplicas 3 (outdated), but replicas is 2. That scale down was already performed previously, and has finished (decom finished), but the pod was not removed yet.
53.154: Patch status (to update observedGen) fails, because it would change currentReplicas to 3 - which is wrong, as the value 3 is based on a stale read, and kube API server rejects it as it's a conflict)

Variants how to fix this

(In this PR, we're picking A, but i am very open to other variants)

A)

At the start of the reconcile loop, always perform an uncached Get to read Cluster resource. this ensures that all actions are based on up to date data. even more so - it ensures the happens-before relationship between write (Patch call in previous reconcile) and and the next iteration exists.

B)

After every Patch(), loop to run Get() until we see the values changed. Then, the next reconcile loop will not have a stale cache, because we've read back the data before.
i am not sure about corner cases; IMHO it feels quite brittle.

C)

Change writing of status changes to Update or use resource version in Patch, to fail update calls that are based on old data.

This will work as well, but it will fail late - when doing status changes while working off stale data already.
i fear that other code paths can make bad decisions if they act based on old data; especially status.currentReplicas is very dangerous - code can easily make decision we will very much regret (eg. eliminating a node). Currently, this would be fine i think, because a Patch/Update call will set DecomBrokerID in status, and would fail. however, it's unpredictable, if code changes, and some actions are first performed outside of kube api (eg. a call to RP admin api).

D)
Update code so it does not get stuck forever, e.g. update code to check if a broker has been decommissioned already, before setting status.decommissionBrokerID. Most of problems of C) still apply; it does not solve this generally, but only for the specifically observed case.

E)

We could store in-memory in ClusterReconciler the last seen resource version (eg. returned by Patch), and use this ResourceVersion in reconcile's Get Cluster. This would guarantee we get "no older than" this version. See https://kubernetes.io/docs/reference/using-api/api-concepts/#semantics-for-get-and-list

Note: controller-runtime does not locally invalidate the cache if Patch/Update was done, it relies on watches to do this, so it's async and no happens-before relationship between Patch/Update and subsequent Get exists.

Refs

Controller-runtime forbids returning both result and error. However, ar.Ensure() code provoked such cases; because it reconciles multiple "inner" resources, and some of them might return an error, in addition to some other returning a ctrl.Result. to over come this, ar.Ensure now makes the deliberate choice to prioritize errors - if any non-requeue error is returned by an attached resource, it will return just the error, and ignore the ctrl.Result. in addition, one case - where a STS cannot reconcile, because "the other STS" is being decommissioned - was changed to not return an error. It's not an error we need to report, it's really just an ask for a requeue to make sure it gets reconciled, after the condition has settled.

It's possible that Get returns a cached, stale value - even if the missed change was performed in the same process. This is because of various trade-offs made in controller-runtime's cache layers. This brings the controller in a dangerous situtation, in practice it happens that a previous run reduced status.currentReplicas from 3 to 2, but the next run reads a stale object and sees 3. To avoid this bug, and other bugs we can not yet foresee, we will perform this one Get with an uncached read. It is a trade-off, where we accept a performance hit, and increased load on kube-api server, but we value the consistency higher. This is limited to this one, most important Get call, of Cluster resource only (one per entire Reconcile). Since controller-runtime has throttling built in, and it's only "that on Cluster Get call", we believe this trade-off is acceptable, and the benefits outweight the cost.

RafalKorepta · 2024-11-18T09:24:44Z

From my experience the problem statement should be solved at the reconciliation logic and not by removing cache. I understand that there might be more code paths that does not correctly handle re-queue after Patch call.

I would stick to option D)

This might be the case where Status is not perfect to store the observed state.

andrewstucki · 2024-11-18T16:09:09Z

I'll take a look in a bit to be able to fully opine as well (since I haven't read the code/problem statement carefully), but just as a gut instinct the solution of "let's remove the cache" is, IMO, not the correct decision like 99+% of the time due to the additional networking overhead you'll be paying for not just explicitly handling potential stale cache reads and just re-queueing reconciliation accordingly when the cache is updated.

chrisseto · 2024-11-18T16:53:26Z

Based off some issues I had been running into with the decommission controller, I do lean towards "let's not use a cache". However, not using a cache is not a magic bullet as Kubernetes is, by nature, an eventually consistent system. There's not getting around that.

IMO fetching the same resource more than once at the start of reconciliation is an anti-pattern. It will constantly lead us to situations like this and will further be exacerbated by having multiple controllers acting on the same resource. We'll not be solving this in a single PR but we should be working towards it when possible.

My preference is to not read from .Status aside from inconsequential data like using timestamps for debounce and data that we know has been freshly written to an instance in memory. So my vote would be D) but I too need to spend some more time thinking through this.

IIUC, the informers interface exposes some method of forcing a synchronization for cases like this but I also need to dig around the details thereof.

We're not going to have two instances of client.Client floating around in a single controller. Full stop.

chrisseto · 2024-11-18T17:00:21Z

I actually wouldn't be opposed to E) assuming we generalize it as a more reliable form of cache that's usable by client.Client.

Is there a reason we can't just consult the Admin API to inform the controller instead of reading from .Status?

birdayz · 2024-11-18T17:06:50Z

status.currentReplicas is some key element that steers removal of pods. it's something else than the number of existing brokers.

Regarding D):
I fear lots of corner cases we do not test yet; testing against "bad/stale" current replicas is a bigger undertaking. This is just one example of things that can go wrong. We're seeing it now, because as part of NodePools testing we're creating tons of clusters.

E): Will require some more refactoring, so the resource version created by Update/Patch can be returned "somewhere" where it gets used.

Basically, i would like to know if you think doing a single uncached Get call is causing a performance hit bad enough to justify the extra work; given this is in the deprecated operator.
Having done testing for last ~40h with this commit, i can at least say it's working, and i need a fix until end of week (this is some unexpected problem that came up).

Requiring read-after-write semantics for a local change is not unreasonable, no matter if eventually consistent system or not. it's indeed not optimal that currentReplicas steers how many brokers we want to have, but it's a core fact of v1, if we like it or not..

andrewstucki · 2024-11-18T19:45:20Z

Hey @birdayz before any sort of merge I think I'd really like to fully understand the problem you're trying to solve.

I understand that stale cache reads are a thing, but specifically I'm trying to understand why they are problematic in your case, because as @chrisseto mentioned, controllers, by definition are meant to be eventually consistent.

I've been debugging a rare condition - in about 1-2% of cloud certification runs (need about 50-100 clusters, with 6 decommissions each) where we decom brokers and move them over to another NodePool, where the cluster ends up with status.decommissionBrokerID being set, but that broker has been decommissioned already.

So, I'm trying to figure out what the exact problem is here. Is decommissionBrokerID being set a problem due to something like:

Reconciliation not being re-triggered if a broker pod dies and so this will never get updated since the reconciler wasn't being called? (i.e. add a Watch on the pod)
We don't properly handle checking if the pod is still alive when we have a status.decommissionBrokerID field?

If either of those isn't the case and we're handling brokers going away during whatever operation decommissionBrokerID represents, is the problem that:

our operations are non-deterministic (i.e. choosing something like said broker id and running an operation on it just chooses some pod in the set at random)?
the operations are non-idempotent (i.e. we return an error when we try and set a decommissioning state because things are already decommissioned and we don't handle the error)?

If any of the 4 above points are true, it tells me that those are the real problem and not the stale cache read and we should handle them accordingly. That said, I could also be missing context as I'm very unfamiliar with the v1 code at this point.

birdayz · 2024-11-19T11:06:02Z

Thanks andrew for taking a closer look.

Reconciliation not being re-triggered if a broker pod dies and so this will never get updated since the reconciler wasn't being called? (i.e. add a Watch on the pod)

No. This is not about reconciliation not getting triggered.

We don't properly handle checking if the pod is still alive when we have a status.decommissionBrokerID field?

kind of. it's not about the pod; it's about the RP broker. pod exists, but rp broker not, as it was removed already. We're incorrectly setting decommissionBrokerID AGAIN - a second time - for a broker that was just decommissioned.

Example:

set decomBrokerID(10).
decom finishes some time later.
Next reconciliation sometimes sets decomBrokerID(10) again, because of the stale read of currentReplicas.

our operations are non-deterministic (i.e. choosing something like said broker id and running an operation on it just chooses some pod in the set at random)?

They are deterministic.

the operations are non-idempotent (i.e. we return an error when we try and set a decommissioning state because things are already decommissioned and we don't handle the error)?

Yes, it's not idempotent. seeing currentReplicas=2, reconciling, seeing currentReplicas=3 again, and then currentReplicas=2 again, do not result in the same outcome, because it is not aligned with redpanda's view of the world. Please have a look at the patches below, they try to make it more idempotent.

To move this forward quickly, i've sent PRs with two other options.

Option C:
#315
It would force a Conflict error from k8s api, as it will update entire status, not just decomBrokerID field. Since it would use the version from the stale get, it would fail on Update().

Option D:
#317 Forcing requeue-on-error before even attempting to set broker id, if we cannot find that broker id in redpanda itself. If we cannot or do not want to enforce read-after-write semantics, this is my favorite. It makes sure we check "both ends" - kubernetes world, and redpanda world, to support the change the code wants to make (downscale again - which is the wrong choice based on stale data). Redpanda's data is always fresh, as admin api calls are uncached, so we're safe. But only works because the admin client is uncached.. so using fresh data does have merit.

I could see another variant that discards the brokerID from status, if the broker is not in redpanda admin api's broker list here:

redpanda-operator/operator/pkg/resources/statefulset_scale.go

Line 190 in 38dbb32

brokerPod, err := r.getDecommissioningPod(ctx, *brokerID)

, but i've not created a PR for that.

Note: these are really patches for this specific symptom of the overall problem. It's quite possible that other code relies on not seeing an old currentReplicas "come back" after an update to that field. That's why i am interested in getting read-after-write semantics for currentReplicas: it is not feasible to retrofit a lot of tests in operator v1 to ensure we're covering all cases, and stale reads of currentReplicas don't cause problems. not even strict consistency, just seeing a change the operator made in the local process. that's much easier in v2 operator, where there's a single status patch at the end of the loop; in v1 it's very ugly because there's status patches/updates all over the place.

birdayz · 2024-11-20T10:09:10Z

Closing in favor of #317

birdayz requested review from RafalKorepta, chrisseto and andrewstucki as code owners November 17, 2024 14:12

birdayz force-pushed the jb/opv1-get-uncached-race branch from c626639 to f9ae0e2 Compare November 17, 2024 14:44

birdayz added 3 commits November 17, 2024 17:42

operator v1: try to avoid problems where status is internally overridden

a0c76f8

birdayz force-pushed the jb/opv1-get-uncached-race branch from f9ae0e2 to 12db325 Compare November 17, 2024 16:57

operator v1: improve test

2c8d198

RafalKorepta approved these changes Nov 18, 2024

View reviewed changes

birdayz changed the base branch from jb/nodepools-requeue to main November 18, 2024 16:49

This was referenced Nov 19, 2024

operator v1: do not decom broker if it's missing in RP #317

Merged

operator v1: fail status update if working on stale data #315

Closed

birdayz closed this Nov 20, 2024

chrisseto deleted the jb/opv1-get-uncached-race branch November 20, 2024 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Operator v1: use uncached client for Get Cluster, to avoid races #314

Operator v1: use uncached client for Get Cluster, to avoid races #314

birdayz commented Nov 17, 2024 •

edited

Loading

RafalKorepta commented Nov 18, 2024

andrewstucki commented Nov 18, 2024

chrisseto commented Nov 18, 2024

chrisseto commented Nov 18, 2024

birdayz commented Nov 18, 2024 •

edited

Loading

andrewstucki commented Nov 18, 2024

birdayz commented Nov 19, 2024 •

edited

Loading

birdayz commented Nov 20, 2024

Operator v1: use uncached client for Get Cluster, to avoid races #314

Operator v1: use uncached client for Get Cluster, to avoid races #314

Conversation

birdayz commented Nov 17, 2024 • edited Loading

Problem statement

Variants how to fix this

Refs

RafalKorepta commented Nov 18, 2024

andrewstucki commented Nov 18, 2024

chrisseto commented Nov 18, 2024

chrisseto commented Nov 18, 2024

birdayz commented Nov 18, 2024 • edited Loading

andrewstucki commented Nov 18, 2024

birdayz commented Nov 19, 2024 • edited Loading

birdayz commented Nov 20, 2024

birdayz commented Nov 17, 2024 •

edited

Loading

birdayz commented Nov 18, 2024 •

edited

Loading

birdayz commented Nov 19, 2024 •

edited

Loading