Skip to content

Commit

Permalink
Increase upgrade remediation to 5 retries
Browse files Browse the repository at this point in the history
As `TestStableUIDAndGeneration` test is flaky in our CI, but in local
environment the failure does not show up. It might be that machine is
overwhelmed by the amount of processes that bootstrap Kubernetes cluster.
This change extends time that is required for Redpanda to be up and ready
by increasing Flux upgrade remediation retires from 0 (default) to 5.

In one of the nightly tests the cert-manager seems to take longer to create
self-signed certificate than the limit that flux helm release have.

Reference
```
helmrepository_controller.go:700: "level"=0 "msg"="artifact up-to-date with remote revision: 'sha256:d5b03c5514669e04ecd4793df2b927bde4309ab8d088e4f8aa52d4e7a9ce2e94'" "controller"="helmrepository" "controllerGroup"="source.toolkit.fluxcd.io" "controllerKind"="HelmRepository" "HelmRepository"={"name"="redpanda-repository" "namespace"="testenv-2cqq8"} "namespace"="testenv-2cqq8" "name"="redpanda-repository" "reconcileID"="e47a07d7-34c9-4532-99cd-798b47d25948"
helmchart_template.go:151: "level"=0 "msg"="HelmChart/testenv-2cqq8/testenv-2cqq8-rp-iex3qh with SourceRef 'HelmRepository/testenv-2cqq8/redpanda-repository' is in-sync" "controller"="helmrelease" "controllerGroup"="helm.toolkit.fluxcd.io" "controllerKind"="HelmRelease" "HelmRelease"={"name"="rp-iex3qh" "namespace"="testenv-2cqq8"} "namespace"="testenv-2cqq8" "name"="rp-iex3qh" "reconcileID"="6480ffe6-2b63-4b0a-b0bd-e0f377670154"
controller.go:324: "msg"="Reconciler error" "error"="error fetching server root CA testenv-2cqq8/rp-iex3qh-default-root-certificate: server TLS certificate not found" "controller"="redpanda" "controllerGroup"="cluster.redpanda.com" "controllerKind"="Redpanda" "Redpanda"={"name"="rp-iex3qh" "namespace"="testenv-2cqq8"} "namespace"="testenv-2cqq8" "name"="rp-iex3qh" "reconcileID"="86702d6b-5859-484b-a1b0-9d847afda18e"
atomic_release.go:419: "level"=0 "msg"="release is in a failed state" "controller"="helmrelease" "controllerGroup"="helm.toolkit.fluxcd.io" "controllerKind"="HelmRelease" "HelmRelease"={"name"="rp-iex3qh" "namespace"="testenv-2cqq8"} "namespace"="testenv-2cqq8" "name"="rp-iex3qh" "reconcileID"="6480ffe6-2b63-4b0a-b0bd-e0f377670154"
controller.go:324: "msg"="Reconciler error" "error"="error fetching server root CA testenv-2cqq8/rp-iex3qh-default-root-certificate: server TLS certificate not found" "controller"="redpanda" "controllerGroup"="cluster.redpanda.com" "controllerKind"="Redpanda" "Redpanda"={"name"="rp-iex3qh" "namespace"="testenv-2cqq8"} "namespace"="testenv-2cqq8" "name"="rp-iex3qh" "reconcileID"="3c29782a-7556-40a6-a8b2-9b5fed123672"
controller.go:324: "msg"="Reconciler error" "error"="terminal error: exceeded maximum retries: cannot remediate failed release" "controller"="helmrelease" "controllerGroup"="helm.toolkit.fluxcd.io" "controllerKind"="HelmRelease" "HelmRelease"={"name"="rp-iex3qh" "namespace"="testenv-2cqq8"} "namespace"="testenv-2cqq8" "name"="rp-iex3qh" "reconcileID"="6480ffe6-2b63-4b0a-b0bd-e0f377670154"
controller.go:324: "msg"="Reconciler error" "error"="error fetching server root CA testenv-2cqq8/rp-iex3qh-default-root-certificate: server TLS certificate not found" "controller"="redpanda" "controllerGroup"="cluster.redpanda.com" "controllerKind"="Redpanda" "Redpanda"={"name"="rp-iex3qh" "namespace"="testenv-2cqq8"} "namespace"="testenv-2cqq8" "name"="rp-iex3qh" "reconcileID"="172e63ab-3553-4c01-a3ba-baca0a338207"
```
https://buildkite.com/redpanda/redpanda-operator/builds/3493#0193901d-c9f9-4b84-a4db-274dec903fe9/1221-2076

```
=== NAME  TestRedpandaController/TestStableUIDAndGeneration
    redpanda_controller_test.go:930: waiting for *v1alpha2.Redpanda "rp-iex3qh" to be ready
    redpanda_controller_test.go:921:
        	Error Trace:	/work/operator/internal/controller/redpanda/redpanda_controller_test.go:921
        	            				/work/operator/internal/controller/redpanda/redpanda_controller_test.go:891
        	            				/work/operator/internal/controller/redpanda/redpanda_controller_test.go:101
        	Error:      	Received unexpected error:
        	            	context deadline exceeded
        	Test:       	TestRedpandaController/TestStableUIDAndGeneration
```
https://buildkite.com/redpanda/redpanda-operator/builds/3493#0193901d-c9f9-4b84-a4db-274dec903fe9/1221-2178
  • Loading branch information
RafalKorepta committed Dec 5, 2024
1 parent 9b4f8cd commit 195334b
Showing 1 changed file with 8 additions and 0 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -819,6 +819,14 @@ func (s *RedpandaControllerSuite) minimalRP(useFlux bool) *redpandav1alpha2.Redp
Spec: redpandav1alpha2.RedpandaSpec{
ChartRef: redpandav1alpha2.ChartRef{
UseFlux: ptr.To(useFlux),
Upgrade: &redpandav1alpha2.HelmUpgrade{
Remediation: &v2beta2.UpgradeRemediation{
// Flux controller might fail before cert-manager creates certificate, because
// the default `retires` value is set to 0, it will not fail the HelmRelease resource
// installation or upgrade. To make CI test run less flaky allow at most 5 retires.
Retries: 5,
},
},
},
// Any empty structs are to make setting them more ergonomic
// without having to worry about nil pointers.
Expand Down

0 comments on commit 195334b

Please sign in to comment.