Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for synchronous operation #68

Open
jonathan-mayer opened this issue Nov 5, 2024 · 5 comments · May be fixed by #75 or #74
Open

Allow for synchronous operation #68

jonathan-mayer opened this issue Nov 5, 2024 · 5 comments · May be fixed by #75 or #74
Assignees
Labels
enhancement New feature or request

Comments

@jonathan-mayer
Copy link
Member

jonathan-mayer commented Nov 5, 2024

Issue

Currently the downscaler doesn't support multiple replicas/running multiple instances on the same workloads.

Problem to solve

If a workload is modified by another instance after the instance already got the workload from Kubernetes the downscaler will throw an error because the resourceVersion changed.
The same errors occur if the workload was modified manually or by something else during scaling.

Further details

Proposal

I don't think it is possible to handle both multiple instances and manual interference perfectly, at least not without erroring or logging sometimes or in some kind of way.
Because of this we should really change the error handling for this error to give more information to the user (possibly link to documentation on the topic).
The error could also be changed to a info or its logging could be turned off by a cli argument as it isn't broken and will just retry in the next scan cycle.

If we want the downscaler to allow for multiple instances we need to implement some way to make this error less likely/not occur with multiple instances of the downscaler.
One way to implement this would be using Kubernetes' leader election using leases. This would let only one downscaler would scale at a time. This would be redundant and stop the error from occurring because of multiple downscaler instances entirely.

Combining both of these would result in the error only being thrown on outside/manual intervention making it much less likely to occur.
Additionally when it occurs the user doesn't need to search for help with this intermittent error because it explains exactly what happend.

Who can address the issue

Go dev with some Kubernetes know-how

Other links/references

./docs/troubleshooting.md#synchronous-operation
https://github.com/kubernetes/client-go/blob/master/examples/leader-election/main.go#L112
https://kubernetes.io/docs/concepts/architecture/leases/#custom-workload

@jonathan-mayer jonathan-mayer added enhancement New feature or request low priority A issue which can wait and doesn't need to be resolved quickly and removed low priority A issue which can wait and doesn't need to be resolved quickly labels Nov 5, 2024
@jonathan-mayer jonathan-mayer changed the title Allow for synchronous operation/ Allow for synchronous operation Nov 5, 2024
@jonathan-mayer
Copy link
Member Author

jonathan-mayer commented Nov 5, 2024

There should be an option to run in "standalone mode". Or the non-"standalone mode" should be only activated when an option is passed. The helm chart should only give the needed permissions for the Kubernetes lease object in "multi replica mode".

Edit: as an uniqe id would have to be passed in for leader election, this argument could be used for "activating" the mode

@samuel-esp
Copy link
Contributor

samuel-esp commented Nov 14, 2024

I've just found a similar problem on Py-Kube-Downscaler. Basically the problem you are describing could happen not only if you are using more than 1 replicas, but also when the downscaler iteration is too long (other entities inside the cluster could modify objects, like a HPA). I still need to check if the code we have here could have the same problem, I think the concurrency already implemented helps in mitigating this

caas-team/py-kube-downscaler#111

@samuel-esp
Copy link
Contributor

The problem could be present since the resources is retrieved before being updated

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L113

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L128

To solve this problem we should catch the error, fetch the resource once again and then update the refreshed resource

@jonathan-mayer
Copy link
Member Author

jonathan-mayer commented Nov 15, 2024

The problem could be present since the resources is retrieved before being updated

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L113

https://github.com/caas-team/GoKubeDownscaler/blob/main/cmd/kubedownscaler/main.go#L128

To solve this problem we should catch the error, fetch the resource once again and then update the refreshed resource

yes, this is exactly what causes the problem.
The only thing i think is that it isn't worth refetching the resource until the downscaler was able to do it without it being modified in-between when the downscaler runs every 30 seconds anyways.
I think as long as we can stop the error from occurring because of multiple replicas, and the error gets logged better than it is now it should be uncommon enough to not be an issue.

If we want the error to happen even less we could also just refetch the resource before it is being modified in the scan, which would reduce the chance of this happening (although like you said the concurrency already reduces the time from getting the resource to it being modified and this would probably 10x the amount of api calls).

@samuel-esp
Copy link
Contributor

In case of multi replicas, this would be the best approach:

One way to implement this would be using Kubernetes' leader election using leases. This would let only one downscaler would scale at a time. This would be redundant and stop the error from occurring because of multiple downscaler instances entirely.

In case of long iterations:

We could let the user choose how to mitigate this behavior. For example introducing an argument like --retry-on-conflict (int) where the user submits the numbers of retries to perform (0 could be the default). In this way the user can still have control on the amount of api calls. I'll try implementing something like this in Py-Kube-Downscaler first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants