Improve performance of the FQL processing #215

burmanm · 2021-10-22T08:21:45Z

What is missing?
The performance for FQL processing is going to be very slow for larger clusters. At the moment the processing is done by querying each pod on every reconcile, regardless if FQL is used by the user or not. This is also done in serial fashion, so the process could take a while with larger clusters.

We should track whether pod has had the FQL enabled so there's no need to poll them. We know when a pod changes state so it shouldn't be an issue to cache this information. That would allow us to see if there's any need to touch the FQL settings and poll the nodes.

Also, this process could probably be done in parallel with the use of coroutines as there's no dependencies between pods.

Additional stuff
I understand we don't have any way of measuring the impact on larger clusters. There's no timers in cass-operator to trace the amount of time spent on each section of reconcile (perhaps something for another ticket?) and we can't even test in envtest how long a large reconcile process would take since we need to emulate management-api somehow.

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-54

Miles-Garnsey · 2021-10-28T02:40:28Z

Also, this process could probably be done in parallel with the use of coroutines as there's no dependencies between pods.

I strongly agree on this part and want to work on it myself if possible. I'd like to broaden the scope to do most HTTP calls to pods in parallel in the HTTP client. I think we could get some good efficiency gains in larger clusters here.

We should track whether pod has had the FQL enabled so there's no need to poll them.

I don't think I agree with this part. As I've mentioned, a user could change FQL via nodetool (which is how they are advised to do it on most sources on Google, for example). In that case the manifest no longer represents the state of the cluster.

I know that there is an argument "if users do crazy things then systems will break", but I think that given most instructions for enabling FQL talk about using nodetool, this doesn't quite fall into the 'crazy things' bucket.

burmanm · 2021-10-28T05:33:40Z

I know that there is an argument "if users do crazy things then systems will break", but I think that given most instructions for enabling FQL talk about using nodetool, this doesn't quite fall into the 'crazy things' bucket.

Most instructions for any given process is different than when used with cass-operator. I don't think this is a good argument in that sense. Also, we do not enforce any other nodetool / hand modifications either, so this would be very different from other processes. One can for example go to the pod, modify configuration to remove the FQL setting from my cassandra.yaml and restart the Cassandra. I'm not sure how the FQL code works after that in the reconcile process if the fql settings have been removed from Cassandra, as the reconciler does not enforce those settings (the CassandraDatacenter CRD will show different state than what the running pod shows).

Or drain the node, remove it from datacenter and who knows what. If user wants to break something, it's a bit difficult to prevent that.

Miles-Garnsey · 2021-11-05T00:20:07Z

I'm looking at how to execute item (1) here - parallelising the HTTP calls throughout the HTTP client.

There is no rational way to write widely-applicable HTTP parallelisation primitives given the lack of generics. I've asked around online about this but the consensus is "wait for generics". We'd ideally like something that does:

func ParMapVoid(fn func(element interface{}) error, onArray []interface{}) []error {...}

But this just isn't possible, as we can't cast an array of element into an array of interface{} This appears to be a real weakness in Go right now.

The alternative I'm thinking of is writing something to do code gen. This does require minor modifications to add markers to the methods in pkg httphelper, but it may be the only solution. I don't know if the code gen frameworks are user friendly enough to make this reasonable however.

burmanm · 2021-11-05T05:57:26Z

I don't see why you would need such structure? In case of FQL reconcile, one would simply modify the existing code to:

wg := sync.WaitGroup{}
for _, podPtr := range PodPtrsFromPodList(podList) {
  wg.Add(1)
  podPtr := podPtr
  go func() {
  // Cut for space
  }
}
wg.Wait()

Obviously, improve cancelation with a context if you wish (to get result.Error quicker if necessary), but not really anything else is needed.

Why would you generate anything for this type of process or try to make another function?

jsanda · 2022-06-15T21:48:20Z

There's no timers in cass-operator to trace the amount of time spent on each section of reconcile (perhaps something for another ticket?)

I created #349. I don't think working on this should be blocked on #349, assuming it would be adequate to just add logging (even if temporary) to report execution times.

I suggest creating a 15 or even 30 node C* cluster in EKS or GKE that is spread across 3 zones to get some baseline numbers.

jsanda · 2022-06-15T22:27:56Z

@Miles-Garnsey @burmanm please provide a poker planning estimate.

Miles-Garnsey · 2022-06-15T23:27:07Z

Considering we don't have any way of measuring the value of this ticket (given we have no performance tests) I'd say this is blocked pending development of a performance testing approach.

The implementation itself: I would rescope to be a broader investigation into how we can parallelise all cluster-wide operations in the HTTP client. I've previously suggested that doing this via generics might be feasible but we'd need an update to Go 1.18.

jsanda · 2022-06-16T02:15:44Z

I disagree that this is blocked due to lack of performance testing. From a quick static analysis we can see that the CheckFullQueryLogging method is an O(N) operation. The execution time is going to increase linearly with cluster size. I think we are all in agreement that this an operation that could safely be parallelized.

It is worth noting that there are other O(N) operations that the operator performs, like starting nodes when initializing the cluster, which cannot safely be parallelized.

Also keep in mind that we only use a single goroutine for performing reconciliation. If CheckFullQueryLogging take a long time for a particular CassandraDatacenter, then the operator is effectively blocked with respect to reconciling any other CassandraDatacenters.

As for the implementation, I am in agreement with @burmanm. I would be inclined to simply the code in a goroutine as he was suggesting.

Miles-Garnsey · 2022-06-16T03:12:10Z

I think we are all in agreement that this an operation that could safely be parallelised.

Could be, but the question is how we prioritise this w.r.t. implementation effort vs benefits. Without knowing the benefits (even if that were after implementation) it is tough to know where to prioritise it relative to other work or understand the value delivered.

If we were going to parallelise, I'd recommend doing this in a generic way so that we aren't delivering a point solution, since this isn't a localised problem.

Also keep in mind that we only use a single goroutine for performing reconciliation. If CheckFullQueryLogging take a long time for a particular CassandraDatacenter, then the operator is effectively blocked with respect to reconciling any other CassandraDatacenters.

Yep, exactly. Our non-use of concurrency isn't a localised problem... It would be good to get a design doc together on how to remediate.

jsanda · 2022-06-23T03:36:51Z

Hey team! Please add your planning poker estimate with ZenHub @burmanm @Miles-Garnsey

burmanm added the enhancement New feature or request label Oct 22, 2021

sync-by-unito bot changed the title ~~Improve performance of the FQL processing~~ K8SSAND-1415 ⁃ Improve performance of the FQL processing Apr 4, 2022

adejanovski added the zh:Assess/Investigate label Jun 15, 2022

jsanda mentioned this issue Jun 15, 2022

Expose metrics for different actions performed during reconciliation in CassandraDatacenter controller #349

Open

adejanovski added zh:Assess/Investigate and removed zh:Assess/Investigate labels Aug 30, 2022

adejanovski moved this to Assess/Investigate in K8ssandra Nov 8, 2022

adejanovski added this to K8ssandra Nov 8, 2022

adejanovski added the assess Issues in the state 'assess' label Sep 3, 2024

sync-by-unito bot changed the title ~~K8SSAND-1415 ⁃ Improve performance of the FQL processing~~ Improve performance of the FQL processing Oct 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of the FQL processing #215

Improve performance of the FQL processing #215

burmanm commented Oct 22, 2021 •

edited by sync-by-unito bot

Loading

Miles-Garnsey commented Oct 28, 2021 •

edited

Loading

burmanm commented Oct 28, 2021

Miles-Garnsey commented Nov 5, 2021 •

edited

Loading

burmanm commented Nov 5, 2021 •

edited

Loading

jsanda commented Jun 15, 2022

jsanda commented Jun 15, 2022

Miles-Garnsey commented Jun 15, 2022

jsanda commented Jun 16, 2022

Miles-Garnsey commented Jun 16, 2022 •

edited

Loading

jsanda commented Jun 23, 2022

Improve performance of the FQL processing #215

Improve performance of the FQL processing #215

Comments

burmanm commented Oct 22, 2021 • edited by sync-by-unito bot Loading

Miles-Garnsey commented Oct 28, 2021 • edited Loading

burmanm commented Oct 28, 2021

Miles-Garnsey commented Nov 5, 2021 • edited Loading

burmanm commented Nov 5, 2021 • edited Loading

jsanda commented Jun 15, 2022

jsanda commented Jun 15, 2022

Miles-Garnsey commented Jun 15, 2022

jsanda commented Jun 16, 2022

Miles-Garnsey commented Jun 16, 2022 • edited Loading

jsanda commented Jun 23, 2022

burmanm commented Oct 22, 2021 •

edited by sync-by-unito bot

Loading

Miles-Garnsey commented Oct 28, 2021 •

edited

Loading

Miles-Garnsey commented Nov 5, 2021 •

edited

Loading

burmanm commented Nov 5, 2021 •

edited

Loading

Miles-Garnsey commented Jun 16, 2022 •

edited

Loading