Available CRDs check feature #2712

raaizik · 2024-07-24T13:27:49Z

Changes

Adds a new controller that reacts whenever a create/delete event has been enqueued for a specific CRD that's become newly available.
Note: this PR introduces only the structure of the feature without any specific CRDs which will be added from their respective PRs (e.g., ClusterClaim, VolumeGroupSnapshotClass and DRClusterConfig).

raaizik · 2024-07-24T13:30:14Z

rewantsoni · 2024-07-24T14:54:27Z

main.go

+	}
+	if len(crds) > 0 {
+		if err = (&crd.CustomResourceDefinitionReconciler{
+			Client:        mgr.GetClient(),


Should we use an uncached Client instead of cached for this controller?

rewantsoni · 2024-07-24T14:56:01Z

hack/crdavail.sh

Let's rename this here as well as in the Dockerfile? maybe ocs-operator-entrypoint or do you have any other suggestions?

iamniting

/hold

Why do we need this PR? I really discourage having such changes without the proper discussion with the maintainers.

Also, I would highly encourage having meaningful commit title and message. The current one does not make any sense.

openshift-ci · 2024-07-25T11:52:32Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: raaizik
Once this PR has been reviewed and has the lgtm label, please ask for approval from iamniting. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

raaizik · 2024-07-25T14:04:00Z

/test ocs-operator-bundle-e2e-aws

iamniting

Could you please separate this commit into two commits?

The first commit should include the controller changes.
The second commit should include the script changes.

I will help test the first commit without the script. I believe we should be sorted with just the first commit.

umangachapagain

This feels like a complex hack to cause a panic without actually using panic().
How did we finalize on this approach?

iamniting · 2024-08-01T09:12:36Z

This feels like a complex hack to cause a panic without actually using panic(). How did we finalize on this approach?

@umangachapagain We do not need this approach at all, Me and Rewant tested the solution without any script and fixed the bugs we had and it works fine. They will soon update the PR.

iamniting

- At the start each reconcile iteration, revalidate which CRD are now
 available. If a CRD of interest is now avail, panic the op

Can you pls rephrase this statement in the commit message, This is not correct. Also please update the commit title.

iamniting · 2024-08-01T13:58:34Z

/test ocs-operator-bundle-e2e-aws

iamniting · 2024-08-01T14:03:54Z

controllers/util/k8sutil.go

@@ -44,6 +45,8 @@ const (
 	OwnerUIDIndexName = "ownerUID"
 )

+var CRDList []string


It is not the same code we tested, Why is this list empty? @rewantsoni, @raaizik Did you tested these new changes?

Note: this PR introduces only the structure of the feature without any specific CRDs which will be added from their respective PRs (e.g., ClusterClaim, VolumeGroupSnapshotClass and DRClusterConfig).

We've tested it with the script. It basically doesn't even create the CRD reconciler since the CRDs list is empty.

We also need to test it with an empty list. We should not assume that it will work.

Tested with with and without entries in CRDList and it works

iamniting · 2024-08-01T14:07:28Z

controllers/util/k8sutil.go

@@ -44,6 +45,8 @@ const (
 	OwnerUIDIndexName = "ownerUID"
 )

+var CRDList []string


Can we rename this variable and please add a comment on what does it holds.

iamniting · 2024-08-01T14:12:08Z

controllers/crd/crd_controller.go

+// Reconcile compares available CRDs maps following either a Create or Delete event
+func (r *CustomResourceDefinitionReconciler) Reconcile(ctx context.Context, request reconcile.Request) (reconcile.Result, error) {
+	r.ctx = ctx
+	r.Log.Info("Reconciling CustomResourceDefinition.", "CRD", klog.KRef(request.Namespace, request.Name))


I think only name is enough here.

nb-ohad

Maybe I am missing something, but the current code will panic the entire pod!
Why have we removed the code that recycles the process without exiting failing the pod?

nb-ohad · 2024-08-04T06:41:34Z

controllers/crd/crd_controller.go

+	if err != nil {
+		return reconcile.Result{}, err
+	}
+	if !reflect.DeepEqual(availableCrds, r.AvailableCrds) {


I do not believe a map guarantees the order of keys if the 2 maps contain different keys. DeepEqual might fail even if both maps contain the exact same keys.

Please move to a "len check + loop on items" for checking equality.

I tried to create a map, added keys in a different order and compare them using DeepEqual and it works
https://go.dev/play/p/uFkmp57xEIp

nb-ohad · 2024-08-04T06:42:36Z

controllers/crd/crd_controller.go

+	}
+	if !reflect.DeepEqual(availableCrds, r.AvailableCrds) {
+		r.Log.Info("CustomResourceDefinitions created/deleted. Restarting process.")
+		panic("CustomResourceDefinitions created/deleted. Restarting process.")


I don't believe panic will allow us to detect the reason for the exit from outside the operator. use os.Exit with a specific exit code.

nb-ohad · 2024-08-04T06:44:32Z

controllers/util/k8sutil.go

@@ -44,6 +45,8 @@ const (
 	OwnerUIDIndexName = "ownerUID"
 )

+var CRDList []string


Global singleton non-cost, non-atomic state inside a utils file is a red flag. Why do we need this?

nb-ohad · 2024-08-04T06:47:04Z

@iamniting @umangachapagain
This approach it not the approach that was discussed. We need to talk about this before merging this one.

nb-ohad · 2024-08-04T06:49:46Z

/hold
Adding a hold until we have a proper discussion and agreement on the approach

Reasons for this enhancement: - A controller cannot set up a watch for a CRD that is not installed on the cluster, trying to set up a watch will panic the operator - There is no known way, that we are aware of, to add a watch later without client cache issue How does the enhancement work around the issue: - A new controller to watch creation/deletion for the CRDs of interest to prevent unnecessary reconciles - On start of the operator(main), detect which CRDs are avail (out of a fixed list) - At the start each reconcile of new controller, we fetch the CRDs available again and compare it with CRDs fetched in previous step, If there is any change, we panic the op Signed-off-by: raaizik <[email protected]> Co-Authored-By: Rewant Soni <[email protected]>

Adds a script that bypasses pod restarts Signed-off-by: raaizik <[email protected]> Co-Authored-By: Rewant Soni <[email protected]>

nb-ohad · 2024-08-08T15:20:19Z

controllers/storagecluster/reconcile.go

+	if !reflect.DeepEqual(availableCrds, r.AvailableCrds) {
+		r.Log.Info("CustomResourceDefinitions created/deleted. Restarting process.")
+		os.Exit(42)
+	}


Available CRD is a map. I am not sure if there is an order guarantee on keys which means that DeepEqual might fail even if all the keys and values are the same. I would suggest changing the check to maps.Equal

I tried to create a map, added keys in a different order, and compared them using DeepEqual and it worked. maps.Equal produces same results
https://go.dev/play/p/uFkmp57xEIp

nb-ohad · 2024-08-08T15:21:49Z

controllers/storagecluster/storagecluster_controller.go

Where is the conditional watch that checks the CRD exists before adding the watch?

Added it in #2745

nb-ohad · 2024-08-08T15:22:35Z

controllers/storagecluster/reconcile.go

+	}
+	if !reflect.DeepEqual(availableCrds, r.AvailableCrds) {
+		r.Log.Info("CustomResourceDefinitions created/deleted. Restarting process.")
+		os.Exit(42)


Can we please have the exit code in a constant with a name that explains the usage?
In addition, a comment about this line will help

Added it with #2745

nb-ohad · 2024-08-08T15:25:08Z

hack/crdavail.sh

+    if [ $EXIT_CODE -ne $RESTART_EXIT_CODE ]; then
+      exit $EXIT_CODE
+    fi
+done


There is a missing newline here and git is complaining

leelavg

I believe it might've been discussed, just out of curiosity, why can't the operation be offloaded to odf-op? odf-op could set env and ocs-op correspondingly sets up the watch? I didn't think of all the possibilities though.

leelavg · 2024-08-09T08:20:25Z

hack/crdavail.sh

+RESTART_EXIT_CODE=42
+
+while true; do
+    ./usr/local/bin/ocs-operator --enable-leader-election --health-probe-bind-address=:8081


we are effectively making this a sub-process rather than a PID 1 and I'm not confident enough in the shell handling signals vs controller-runtime.

I could see good amount of comments on this PR, in brief, may I know why did we backoff from kubelet restarting us?

leelavg · 2024-08-09T08:22:15Z

controllers/util/k8sutil.go

@@ -44,6 +45,10 @@ const (
 	OwnerUIDIndexName = "ownerUID"
 )

+func GetCrds() []string {
+	return []string{}


is my understanding correct that in upcoming PRs we register the CRDNames in this and conditionally watch on them in the setupwithmanager?

Added it with #2745

leelavg · 2024-08-09T08:24:24Z

controllers/storagecluster/storagecluster_controller.go

+		DeleteFunc: func(e event.TypedDeleteEvent[client.Object]) bool {
+			crdAvailable, keyExist := r.AvailableCrds[e.Object.GetName()]
+			if keyExist && crdAvailable {
+				r.Log.Info("CustomResourceDefinition %s was Deleted.", e.Object.GetName())
+				return true
+			}
+			return false
+		},


is there a possibility that CRD deletion is less than our operator lifetime?

Although it is very unlikely but it is possible, and if they happens it will lead to a pod restart

leelavg · 2024-08-09T08:33:25Z

controllers/storagecluster/storagecluster_controller.go

@@ -228,6 +254,7 @@ func (r *StorageClusterReconciler) SetupWithManager(mgr ctrl.Manager) error {
 		Owns(&corev1.Secret{}, builder.WithPredicates(predicate.GenerationChangedPredicate{})).
 		Owns(&routev1.Route{}).
 		Owns(&templatev1.Template{}).
+		Watches(&extv1.CustomResourceDefinition{}, enqueueStorageClusterRequest, builder.WithPredicates(crdPredicate)).


we already made a mistake (with Virt) in watching all CRDs increasing the memory footprint, my mistake in not pushing hard on this fix #2539, seeing this PR we don't want to watch all CRDs until & unless necessary.

Clubbed w/ 2539 we could do something like below as we are only interested in the metadata of CRDs while mapping CRDs.

crd := &metav1.PartialObjectMetadata{} crd.SetGroupVersionKind(extv1.SchemeGroupVersion.WithKind("CustomResourceDefinition")) crd.Name = <CRDName> client.Get(<ctx>, <ns-name>, crd)

Looked at this and I couldn't find a way to specify multiple names in the cache using field selector, But I have updated it to use PartialObjectMetadata here: #2745

openshift-merge-robot · 2024-08-10T00:12:18Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

openshift-ci bot requested a review from nb-ohad July 24, 2024 13:30

raaizik force-pushed the availcrds branch from 068fc9f to df31699 Compare July 24, 2024 13:34

raaizik marked this pull request as draft July 24, 2024 13:37

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 24, 2024

raaizik force-pushed the availcrds branch 8 times, most recently from 84f0cbe to ddf5c5f Compare July 24, 2024 14:49

raaizik marked this pull request as ready for review July 24, 2024 14:52

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 24, 2024

rewantsoni reviewed Jul 24, 2024

View reviewed changes

iamniting requested changes Jul 25, 2024

View reviewed changes

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 25, 2024

openshift-ci bot assigned iamniting Jul 25, 2024

nb-ohad requested a review from umangachapagain July 25, 2024 11:15

raaizik force-pushed the availcrds branch from ddf5c5f to 364618c Compare July 25, 2024 11:52

raaizik force-pushed the availcrds branch from 364618c to e9043f4 Compare July 25, 2024 11:55

raaizik requested a review from iamniting July 25, 2024 11:57

iamniting requested changes Jul 31, 2024

View reviewed changes

umangachapagain requested changes Aug 1, 2024

View reviewed changes

openshift-ci bot assigned umangachapagain Aug 1, 2024

raaizik force-pushed the availcrds branch from e9043f4 to 68ed6dc Compare August 1, 2024 09:38

raaizik force-pushed the availcrds branch 3 times, most recently from a03388c to f7a3cae Compare August 1, 2024 09:58

iamniting requested changes Aug 1, 2024

View reviewed changes

raaizik force-pushed the availcrds branch from f7a3cae to 51cc46b Compare August 1, 2024 12:12

raaizik requested a review from iamniting August 1, 2024 12:12

raaizik force-pushed the availcrds branch 2 times, most recently from 1c82a37 to 2137468 Compare August 1, 2024 12:15

iamniting requested changes Aug 1, 2024

View reviewed changes

nb-ohad requested changes Aug 4, 2024

View reviewed changes

openshift-ci bot assigned nb-ohad Aug 4, 2024

raaizik changed the title ~~Available CRDs check feature~~ Available CRDs check feature (w/o script) Aug 8, 2024

raaizik mentioned this pull request Aug 8, 2024

Available CRDs check feature (with script) #2741

Closed

raaizik force-pushed the availcrds branch 2 times, most recently from 19fd2ec to 48e6dab Compare August 8, 2024 14:14

raaizik and others added 2 commits August 8, 2024 17:31

Available CRDs check feature w script

3617868

Adds a script that bypasses pod restarts Signed-off-by: raaizik <[email protected]> Co-Authored-By: Rewant Soni <[email protected]>

raaizik force-pushed the availcrds branch from b246770 to 3617868 Compare August 8, 2024 14:31

raaizik changed the title ~~Available CRDs check feature (w/o script)~~ Available CRDs check feature Aug 8, 2024

raaizik requested review from nb-ohad and iamniting August 8, 2024 14:36

nb-ohad requested changes Aug 8, 2024

View reviewed changes

nb-ohad reviewed Aug 8, 2024

View reviewed changes

leelavg reviewed Aug 9, 2024

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 10, 2024

Available CRDs check feature #2712

Are you sure you want to change the base?

Available CRDs check feature #2712

Conversation

raaizik commented Jul 24, 2024 • edited Loading

Changes

raaizik commented Jul 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iamniting left a comment

Choose a reason for hiding this comment

openshift-ci bot commented Jul 25, 2024

raaizik commented Jul 25, 2024

iamniting left a comment

Choose a reason for hiding this comment

umangachapagain left a comment

Choose a reason for hiding this comment

iamniting commented Aug 1, 2024

iamniting left a comment

Choose a reason for hiding this comment

iamniting commented Aug 1, 2024

Choose a reason for hiding this comment

raaizik Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

iamniting Aug 1, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nb-ohad left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nb-ohad commented Aug 4, 2024

nb-ohad commented Aug 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

leelavg left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-merge-robot commented Aug 10, 2024

raaizik commented Jul 24, 2024 •

edited

Loading

raaizik Aug 1, 2024 •

edited

Loading

iamniting Aug 1, 2024 •

edited

Loading

nb-ohad left a comment •

edited

Loading