Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of the FQL processing #215

Open
burmanm opened this issue Oct 22, 2021 · 10 comments
Open

Improve performance of the FQL processing #215

burmanm opened this issue Oct 22, 2021 · 10 comments
Labels
assess Issues in the state 'assess' enhancement New feature or request zh:Assess/Investigate

Comments

@burmanm
Copy link
Contributor

burmanm commented Oct 22, 2021

What is missing?
The performance for FQL processing is going to be very slow for larger clusters. At the moment the processing is done by querying each pod on every reconcile, regardless if FQL is used by the user or not. This is also done in serial fashion, so the process could take a while with larger clusters.

We should track whether pod has had the FQL enabled so there's no need to poll them. We know when a pod changes state so it shouldn't be an issue to cache this information. That would allow us to see if there's any need to touch the FQL settings and poll the nodes.

Also, this process could probably be done in parallel with the use of coroutines as there's no dependencies between pods.

Additional stuff
I understand we don't have any way of measuring the impact on larger clusters. There's no timers in cass-operator to trace the amount of time spent on each section of reconcile (perhaps something for another ticket?) and we can't even test in envtest how long a large reconcile process would take since we need to emulate management-api somehow.

┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: CASS-54

@burmanm burmanm added the enhancement New feature or request label Oct 22, 2021
@Miles-Garnsey
Copy link
Member

Miles-Garnsey commented Oct 28, 2021

Also, this process could probably be done in parallel with the use of coroutines as there's no dependencies between pods.

I strongly agree on this part and want to work on it myself if possible. I'd like to broaden the scope to do most HTTP calls to pods in parallel in the HTTP client. I think we could get some good efficiency gains in larger clusters here.

We should track whether pod has had the FQL enabled so there's no need to poll them.

I don't think I agree with this part. As I've mentioned, a user could change FQL via nodetool (which is how they are advised to do it on most sources on Google, for example). In that case the manifest no longer represents the state of the cluster.

I know that there is an argument "if users do crazy things then systems will break", but I think that given most instructions for enabling FQL talk about using nodetool, this doesn't quite fall into the 'crazy things' bucket.

@burmanm
Copy link
Contributor Author

burmanm commented Oct 28, 2021

I know that there is an argument "if users do crazy things then systems will break", but I think that given most instructions for enabling FQL talk about using nodetool, this doesn't quite fall into the 'crazy things' bucket.

Most instructions for any given process is different than when used with cass-operator. I don't think this is a good argument in that sense. Also, we do not enforce any other nodetool / hand modifications either, so this would be very different from other processes. One can for example go to the pod, modify configuration to remove the FQL setting from my cassandra.yaml and restart the Cassandra. I'm not sure how the FQL code works after that in the reconcile process if the fql settings have been removed from Cassandra, as the reconciler does not enforce those settings (the CassandraDatacenter CRD will show different state than what the running pod shows).

Or drain the node, remove it from datacenter and who knows what. If user wants to break something, it's a bit difficult to prevent that.

@Miles-Garnsey
Copy link
Member

Miles-Garnsey commented Nov 5, 2021

I'm looking at how to execute item (1) here - parallelising the HTTP calls throughout the HTTP client.

There is no rational way to write widely-applicable HTTP parallelisation primitives given the lack of generics. I've asked around online about this but the consensus is "wait for generics". We'd ideally like something that does:

func ParMapVoid(fn func(element interface{}) error, onArray []interface{}) []error {...}

But this just isn't possible, as we can't cast an array of element into an array of interface{} This appears to be a real weakness in Go right now.

The alternative I'm thinking of is writing something to do code gen. This does require minor modifications to add markers to the methods in pkg httphelper, but it may be the only solution. I don't know if the code gen frameworks are user friendly enough to make this reasonable however.

@burmanm
Copy link
Contributor Author

burmanm commented Nov 5, 2021

I don't see why you would need such structure? In case of FQL reconcile, one would simply modify the existing code to:

wg := sync.WaitGroup{}
for _, podPtr := range PodPtrsFromPodList(podList) {
  wg.Add(1)
  podPtr := podPtr
  go func() {
  // Cut for space
  }
}
wg.Wait()

Obviously, improve cancelation with a context if you wish (to get result.Error quicker if necessary), but not really anything else is needed.

Why would you generate anything for this type of process or try to make another function?

@sync-by-unito sync-by-unito bot changed the title Improve performance of the FQL processing K8SSAND-1415 ⁃ Improve performance of the FQL processing Apr 4, 2022
@jsanda
Copy link
Contributor

jsanda commented Jun 15, 2022

There's no timers in cass-operator to trace the amount of time spent on each section of reconcile (perhaps something for another ticket?)

I created #349. I don't think working on this should be blocked on #349, assuming it would be adequate to just add logging (even if temporary) to report execution times.

I suggest creating a 15 or even 30 node C* cluster in EKS or GKE that is spread across 3 zones to get some baseline numbers.

@jsanda
Copy link
Contributor

jsanda commented Jun 15, 2022

@Miles-Garnsey @burmanm please provide a poker planning estimate.

@Miles-Garnsey
Copy link
Member

Considering we don't have any way of measuring the value of this ticket (given we have no performance tests) I'd say this is blocked pending development of a performance testing approach.

The implementation itself: I would rescope to be a broader investigation into how we can parallelise all cluster-wide operations in the HTTP client. I've previously suggested that doing this via generics might be feasible but we'd need an update to Go 1.18.

@jsanda
Copy link
Contributor

jsanda commented Jun 16, 2022

I disagree that this is blocked due to lack of performance testing. From a quick static analysis we can see that the CheckFullQueryLogging method is an O(N) operation. The execution time is going to increase linearly with cluster size. I think we are all in agreement that this an operation that could safely be parallelized.

It is worth noting that there are other O(N) operations that the operator performs, like starting nodes when initializing the cluster, which cannot safely be parallelized.

Also keep in mind that we only use a single goroutine for performing reconciliation. If CheckFullQueryLogging take a long time for a particular CassandraDatacenter, then the operator is effectively blocked with respect to reconciling any other CassandraDatacenters.

As for the implementation, I am in agreement with @burmanm. I would be inclined to simply the code in a goroutine as he was suggesting.

@Miles-Garnsey
Copy link
Member

Miles-Garnsey commented Jun 16, 2022

I think we are all in agreement that this an operation that could safely be parallelised.

Could be, but the question is how we prioritise this w.r.t. implementation effort vs benefits. Without knowing the benefits (even if that were after implementation) it is tough to know where to prioritise it relative to other work or understand the value delivered.

If we were going to parallelise, I'd recommend doing this in a generic way so that we aren't delivering a point solution, since this isn't a localised problem.

Also keep in mind that we only use a single goroutine for performing reconciliation. If CheckFullQueryLogging take a long time for a particular CassandraDatacenter, then the operator is effectively blocked with respect to reconciling any other CassandraDatacenters.

Yep, exactly. Our non-use of concurrency isn't a localised problem... It would be good to get a design doc together on how to remediate.

@jsanda
Copy link
Contributor

jsanda commented Jun 23, 2022

Hey team! Please add your planning poker estimate with ZenHub @burmanm @Miles-Garnsey

@adejanovski adejanovski moved this to Assess/Investigate in K8ssandra Nov 8, 2022
@adejanovski adejanovski added the assess Issues in the state 'assess' label Sep 3, 2024
@sync-by-unito sync-by-unito bot changed the title K8SSAND-1415 ⁃ Improve performance of the FQL processing Improve performance of the FQL processing Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assess Issues in the state 'assess' enhancement New feature or request zh:Assess/Investigate
Projects
No open projects
Status: Assess/Investigate
Development

No branches or pull requests

4 participants