Feat/retries on conflict error #74

samuel-esp · 2024-12-14T17:30:12Z

Motivation

Give the user the possibility to choose how to handle HTTP 409 conflict errors. Such conflicts typically occur when another entity (such as an HPA, CI/CD pipeline, or manual intervention) modifies a resource just before KubeDownscaler processes it

See #68 or caas-team/py-kube-downscaler#111

Changes

Introduced --max-retries-on-conflict argument like in Py-Kube-Downscaler
Introduced GetWorkload() function to handle the use case when the downscaler needs to retrieve a single Kubernetes resource (before it was only possible to get a list of resources, i.e kubectl get deploy -n default). the old GetWorkload() was renamed to GetWorkloads() to reflect the changes
Introduced a new function GetResourceType() that returns the resource type (string)
Refactored the main loop to be able to use --max-retries-on-conflict

Tests done

Unit Tests

TODO

I've assigned myself to this PR
Refactored docs
Added more unit tests on this specific use case

jonathan-mayer

Also just so you know, the workflows were broken for forks (thats why they were failing). We've fixed it but it will now run every workflow twice in this and the other pr. If you want you can rebase the branches with main and the errors will go away.

cmd/kubedownscaler/main.go

jonathan-mayer · 2025-01-07T10:02:51Z

cmd/kubedownscaler/main.go

+				for {
+					err := scanWorkload(workload, client, ctx, layerCli, layerEnv)
+					if err != nil {
+						if strings.Contains(err.Error(), "the object has been modified") {


I think we should avoid checking the error string here, since it can change at any time. I think it would be better to see if we can get the underlying error type and check if the error is an instance of it.

ok ive looked and there is no higher level underlying error type we could use for this. What i did find is where the message is coming from which means, that we can reference that instead of using just a string. With that said, I would change this to:

import "k8s.io/apiserver/pkg/registry/generic/registry"

Suggested change

if strings.Contains(err.Error(), "the object has been modified") {

if strings.Contains(err.Error(), registry.OptimisticLockErrorMsg) {

jonathan-mayer · 2025-01-07T10:03:08Z

cmd/kubedownscaler/main.go

-				if err != nil {
-					slog.Error("failed to scan workload", "error", err, "workload", workload.GetName(), "namespace", workload.GetNamespace())
-					return
+				attempts := 0


Suggested change

attempts := 0

var attempts int

jonathan-mayer · 2025-01-07T10:08:35Z

cmd/kubedownscaler/main.go

-					slog.Error("failed to scan workload", "error", err, "workload", workload.GetName(), "namespace", workload.GetNamespace())
-					return
+				attempts := 0
+				for {


i would change this to a for i as we can break out either way and then dont have to handle incrementing nor declaring the attempts variable

jonathan-mayer · 2025-01-07T10:10:00Z

cmd/kubedownscaler/main.go

+						if strings.Contains(err.Error(), "the object has been modified") {
+							if attempts >= maxRetriesOnConflict {
+								if maxRetriesOnConflict > 0 {
+									slog.Error("max retries reached, will try again in the next cycle", "workload", workload.GetName(), "namespace", workload.GetNamespace())


Suggested change

slog.Error("max retries reached, will try again in the next cycle", "workload", workload.GetName(), "namespace", workload.GetNamespace())

slog.Error("max retries reached, will try again in the next scan", "workload", workload.GetName(), "namespace", workload.GetNamespace())

jonathan-mayer · 2025-01-07T10:12:09Z

cmd/kubedownscaler/main.go

+								return
+							}
+							workload = updatedWorkload
+						} else {


use guard clauses where applicable

jonathan-mayer · 2025-01-07T12:31:29Z

internal/pkg/scalable/cronjobs.go

+	}
+	result = &suspendScaledWorkload{&cronJob{cronjob}}
+	return result, nil
+}


I really think we shouldn't have a whole other function just to get just 1 of the resource.

Suggested change

}

// getCronJobs is the getResourceFunc for CronJobs

func getCronJobs(name, namespace string, clientsets *Clientsets, ctx context.Context) ([]Workload, error) {

var results []Workload

if name != "" {

cronjob, err := clientsets.Kubernetes.BatchV1().CronJobs(namespace).Get(ctx, name, metav1.GetOptions{})

if err != nil {

return result, fmt.Errorf("failed to get cronjob: %w", err)

}

results = append(results, &suspendScaledWorkload{&cronJob{cronjob}})

return results

}

cronjobs, err := clientsets.Kubernetes.BatchV1().CronJobs(namespace).List(ctx, metav1.ListOptions{TimeoutSeconds: &timeout})

if err != nil {

return nil, fmt.Errorf("failed to get cronjobs: %w", err)

}

for _, item := range cronjobs.Items {

results = append(results, &suspendScaledWorkload{&cronJob{&item}})

}

return results, nil

}

I think we should keep the client.GetWorkload and GetWorkloads in single functions. This will also help abstract this configuration away.

samuel-esp added 3 commits December 14, 2024 18:07

feat: added max-retries-on-conflict support for conflict errors

bf60147

feat: added docs for --max-retries-on-conflict arg

7ee5d4e

feat: refactored troubleshooting.md

27c4a65

jonathan-mayer assigned samuel-esp Jan 7, 2025

jonathan-mayer added the enhancement New feature or request label Jan 7, 2025

jonathan-mayer linked an issue Jan 7, 2025 that may be closed by this pull request

Allow for synchronous operation #68

Open

jonathan-mayer reviewed Jan 7, 2025

View reviewed changes

jonathan-mayer requested changes Jan 7, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/retries on conflict error #74

Feat/retries on conflict error #74

samuel-esp commented Dec 14, 2024 •

edited

Loading

jonathan-mayer left a comment

jonathan-mayer Jan 7, 2025

jonathan-mayer Jan 7, 2025

jonathan-mayer Jan 7, 2025

jonathan-mayer Jan 7, 2025

jonathan-mayer Jan 7, 2025

jonathan-mayer Jan 7, 2025 •

edited

Loading

jonathan-mayer Jan 7, 2025

	if strings.Contains(err.Error(), "the object has been modified") {
	if strings.Contains(err.Error(), registry.OptimisticLockErrorMsg) {

	slog.Error("max retries reached, will try again in the next cycle", "workload", workload.GetName(), "namespace", workload.GetNamespace())
	slog.Error("max retries reached, will try again in the next scan", "workload", workload.GetName(), "namespace", workload.GetNamespace())

-}
+// getCronJobs is the getResourceFunc for CronJobs
+func getCronJobs(name, namespace string, clientsets *Clientsets, ctx context.Context) ([]Workload, error) {
+    var results []Workload
+    if name != "" {
+        cronjob, err := clientsets.Kubernetes.BatchV1().CronJobs(namespace).Get(ctx, name, metav1.GetOptions{})
+        if err != nil {
+            return result, fmt.Errorf("failed to get cronjob: %w", err)
+        }
+        results = append(results, &suspendScaledWorkload{&cronJob{cronjob}})
+        return results
+    }
+    cronjobs, err := clientsets.Kubernetes.BatchV1().CronJobs(namespace).List(ctx, metav1.ListOptions{TimeoutSeconds: &timeout})
+    if err != nil {
+        return nil, fmt.Errorf("failed to get cronjobs: %w", err)
+    }
+    for _, item := range cronjobs.Items {
+        results = append(results, &suspendScaledWorkload{&cronJob{&item}})
+    }
+    return results, nil
+}

Feat/retries on conflict error #74

Are you sure you want to change the base?

Feat/retries on conflict error #74

Conversation

samuel-esp commented Dec 14, 2024 • edited Loading

Motivation

Changes

Tests done

TODO

jonathan-mayer left a comment

Choose a reason for hiding this comment

jonathan-mayer Jan 7, 2025

Choose a reason for hiding this comment

jonathan-mayer Jan 7, 2025

Choose a reason for hiding this comment

jonathan-mayer Jan 7, 2025

Choose a reason for hiding this comment

jonathan-mayer Jan 7, 2025

Choose a reason for hiding this comment

jonathan-mayer Jan 7, 2025

Choose a reason for hiding this comment

jonathan-mayer Jan 7, 2025 • edited Loading

Choose a reason for hiding this comment

jonathan-mayer Jan 7, 2025

Choose a reason for hiding this comment

samuel-esp commented Dec 14, 2024 •

edited

Loading

jonathan-mayer Jan 7, 2025 •

edited

Loading