Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/retries on conflict error #74

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

samuel-esp
Copy link
Contributor

@samuel-esp samuel-esp commented Dec 14, 2024

Motivation

Give the user the possibility to choose how to handle HTTP 409 conflict errors. Such conflicts typically occur when another entity (such as an HPA, CI/CD pipeline, or manual intervention) modifies a resource just before KubeDownscaler processes it

See #68 or caas-team/py-kube-downscaler#111

Changes

  • Introduced --max-retries-on-conflict argument like in Py-Kube-Downscaler
  • Introduced GetWorkload() function to handle the use case when the downscaler needs to retrieve a single Kubernetes resource (before it was only possible to get a list of resources, i.e kubectl get deploy -n default). the old GetWorkload() was renamed to GetWorkloads() to reflect the changes
  • Introduced a new function GetResourceType() that returns the resource type (string)
  • Refactored the main loop to be able to use --max-retries-on-conflict

Tests done

  • Unit Tests

TODO

  • I've assigned myself to this PR
  • Refactored docs
  • Added more unit tests on this specific use case

@jonathan-mayer jonathan-mayer added the enhancement New feature or request label Jan 7, 2025
@jonathan-mayer jonathan-mayer linked an issue Jan 7, 2025 that may be closed by this pull request
Copy link
Member

@jonathan-mayer jonathan-mayer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also just so you know, the workflows were broken for forks (thats why they were failing). We've fixed it but it will now run every workflow twice in this and the other pr. If you want you can rebase the branches with main and the errors will go away.

cmd/kubedownscaler/main.go Show resolved Hide resolved
for {
err := scanWorkload(workload, client, ctx, layerCli, layerEnv)
if err != nil {
if strings.Contains(err.Error(), "the object has been modified") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should avoid checking the error string here, since it can change at any time. I think it would be better to see if we can get the underlying error type and check if the error is an instance of it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok ive looked and there is no higher level underlying error type we could use for this. What i did find is where the message is coming from which means, that we can reference that instead of using just a string. With that said, I would change this to:

import "k8s.io/apiserver/pkg/registry/generic/registry"
Suggested change
if strings.Contains(err.Error(), "the object has been modified") {
if strings.Contains(err.Error(), registry.OptimisticLockErrorMsg) {

if err != nil {
slog.Error("failed to scan workload", "error", err, "workload", workload.GetName(), "namespace", workload.GetNamespace())
return
attempts := 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
attempts := 0
var attempts int

slog.Error("failed to scan workload", "error", err, "workload", workload.GetName(), "namespace", workload.GetNamespace())
return
attempts := 0
for {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would change this to a for i as we can break out either way and then dont have to handle incrementing nor declaring the attempts variable

if strings.Contains(err.Error(), "the object has been modified") {
if attempts >= maxRetriesOnConflict {
if maxRetriesOnConflict > 0 {
slog.Error("max retries reached, will try again in the next cycle", "workload", workload.GetName(), "namespace", workload.GetNamespace())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
slog.Error("max retries reached, will try again in the next cycle", "workload", workload.GetName(), "namespace", workload.GetNamespace())
slog.Error("max retries reached, will try again in the next scan", "workload", workload.GetName(), "namespace", workload.GetNamespace())

return
}
workload = updatedWorkload
} else {
Copy link
Member

@jonathan-mayer jonathan-mayer Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use guard clauses where applicable

}
result = &suspendScaledWorkload{&cronJob{cronjob}}
return result, nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we shouldn't have a whole other function just to get just 1 of the resource.

Suggested change
}
// getCronJobs is the getResourceFunc for CronJobs
func getCronJobs(name, namespace string, clientsets *Clientsets, ctx context.Context) ([]Workload, error) {
var results []Workload
if name != "" {
cronjob, err := clientsets.Kubernetes.BatchV1().CronJobs(namespace).Get(ctx, name, metav1.GetOptions{})
if err != nil {
return result, fmt.Errorf("failed to get cronjob: %w", err)
}
results = append(results, &suspendScaledWorkload{&cronJob{cronjob}})
return results
}
cronjobs, err := clientsets.Kubernetes.BatchV1().CronJobs(namespace).List(ctx, metav1.ListOptions{TimeoutSeconds: &timeout})
if err != nil {
return nil, fmt.Errorf("failed to get cronjobs: %w", err)
}
for _, item := range cronjobs.Items {
results = append(results, &suspendScaledWorkload{&cronJob{&item}})
}
return results, nil
}

I think we should keep the client.GetWorkload and GetWorkloads in single functions. This will also help abstract this configuration away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for synchronous operation
2 participants