serial tests fixes #1195

shajmakh · 2025-02-19T10:51:41Z

please see commits for more details.

openshift-ci · 2025-02-19T10:51:52Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [shajmakh]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

shajmakh · 2025-02-26T08:52:45Z

proposal for rewriting the test is found here: #1196

shajmakh · 2025-02-26T11:15:48Z

/retest

test/e2e/serial/tests/configuration.go

internal/noderesourcetopology/equality.go

ffromani

mixed bag. Some commits are clear improvements, some other raises some question.
I can see the reason for all the changes, but some implementations can use some polishing, probably.

internal/devices/devices.go

internal/noderesourcetopology/equality.go

test/e2e/serial/tests/configuration.go

ffromani · 2025-02-26T11:36:59Z

test/e2e/serial/tests/tolerations.go

@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration {
 }

 func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) {


this is a good idea but I'm puzzled by the implementation. We don't have GinkgoHelper() calls anymore, so failures should point to the specific test, should not they?

without the GinkgoHelper() call in a "helper" function, one will be able to see the exact failing line in the nested function and not the line of where this function is being called.
IOW if inside a spec there is a call for a helper function in line 5, and that helper function has GinkgoHelper(); if the function fails on one of its assertions say line 91 (let's assume it's in the same file), the report will print out that the spec has failed in line 5. Whereas if GinkgoHelper() was to called, the report will print out that the spec has failed in line 91.

test/e2e/serial/tests/tolerations.go

test/internal/deploy/deploy.go

ffromani · 2025-02-26T11:39:24Z

test/e2e/serial/tests/configuration.go

@@ -1510,3 +1523,19 @@ func getLabelRoleWorker() string {
 func getLabelRoleMCPTest() string {
 	return fmt.Sprintf("%s/%s", depnodes.LabelRole, roleMCPTest)
 }
+
+func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool {
+	GinkgoHelper()


so here we do want to call GinkgoHelper?

the main purpose of GinkgoHelper() is to avoid using offset() for location calculation of the running code. It is especially helpful to be called on functions that are highly reused or are called several times in one spec. In that case it would be more clear to debug failures to get the location of the failing call rather than the failing line within the helper function.
Considering this helper function is short and well defined I didn't mind to add the GinkgoHelper() call here. If this causes confusion we can remove it and address all that whenever the function is highly requested, then we can propose a cleanup PR to add this call if needed.

I'm not completely sure I follow the logic, but this specific application makes sense to me, so let's move on

thing is: the problem I'm seeing (and which prompted the previous conversation?) was related to bad helpers which had too many expectations inside them and whose failure was hard to debug. This is still kinda related to the helper length, but the key discriminatror should be number and likeness of failure of expectations inside the helper rather than the helper length itself. Anyway, overall makes sense to me so we can discuss nuances later on.

Tal-or

Looks like a lot of good improvements, good work

test/e2e/serial/tests/configuration.go

Tal-or · 2025-03-03T15:10:11Z

test/e2e/serial/tests/configuration.go

+func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool {
+	GinkgoHelper()
+
+	const minCustomSupportingVString = "4.18"


This commit is going to be merge to 4.19 and (probably) backport to 4.18, so why there's a need to check if this code is running against a version older than 4.18

because we use and maintain same test image for all versions, so this code will also run on < 4.18 versions, and we do not backport changes done to the serial suite

What test image?
I thought that we are checking out the code according to the cluster version and then run the appropriate tests.

It would be problematic to run tests adjusted for 4.19 on 4.17 because there are significant changes between the two. for example we no longer have MachineConfig

I thought that we are checking out the code according to the cluster version and then run the appropriate tests.

yes we do that in the test image. This means the latest (main branch) version of the suite needs to detect the operator capabilities and adjust accordingly.

but if you're checking the branch to release-4.17, this piece of code won't be shown there.
What am I missing?

@Tal-or this was implemented u/s eventually to and one time d/s as it turned out the best approach to avoid maintenance load as much as possible, check this for the details:
https://github.com/openshift-kni/numaresources-operator/tree/main/doc/features

minCustomSupportingVString is hardcoded indeed, but configuration.PlatVersion is autodetected from the cluster the suite is running against

but if you're checking the branch to release-4.17, this piece of code won't be shown there.
What am I missing?

This piece of code will run on all releases and is shipped in our latest quay tests' image. The tl;dr is that the operator controller pod for every release exposes which features are supported per the running version; d/s we scan the serial tests for all the tests that are tagged with the supported feature, that way we run only supported tests. Active features per branch can be found here:
https://github.com/openshift-kni/numaresources-operator/blob/main/internal/api/features/_topics.json
@Tal-or let me know if you have additional concerns.

ffromani · 2025-03-06T13:18:22Z

please rebase before you resubmit

Problem: Reboot tests fails frequently on teardown when comparing the quantaty of the device resources. The root cause is from the sample-device plugin but haven't yet fully investigated considering we **want** to move to real devices/ sriov simulation and due to capacity. What do you want? Make the serial suite less noisy due to to known issue, until we fix the root cause or implement the alternative: https://issues.redhat.com/browse/CNF-12824 How are you fixing the problem in this PR? similarly to memory resources deviation for reboot tests, allow devices quantity to change after reboot but under the condition that the ratios are the same before and after the test in the context of allocatables and availables; That means capacities of the same device may change before vs after the test, but unequal resources consumption is not tolerated. How did you test the change? On a cluster with sample devices simulate the reboot scenario by running an empty test with only a time.Sleep() command, once the sleep period starts, update the devices count manually on the cluster (reapply CM and restart the DS). The teardown should reflect that the deviation is noticed but not failing on that. Signed-off-by: Shereen Haj <[email protected]>

Add and use the MCPInfo.ToString() and print out the built info. Signed-off-by: Shereen Haj <[email protected]>

Previously using the MCP object that we built manually wasn't wrong because the specified config is empty anyway; still, do the correct use by pulling the updated object and use it in the MCPInfo object. Signed-off-by: Shereen Haj <[email protected]>

This shouldn't be an issue if the original MCP that is configured in NROP is targeting worker nodes (MCP=worker); Still don't depend on this assumption and use the NRT nodes. Signed-off-by: Shereen Haj <[email protected]>

When GinkgoHelper() was present (f56658a) It was hard to follow the logs and understand which step failed while waiting for MCP update. Adding unique identifiers to each call helps track the start and end of each call and enhanced tracking down the root cause. To guarantee using a unique identifier, we use `time.Now()` as a string. Additionally report config values and expectations of the call. Signed-off-by: Shereen Haj <[email protected]>

With the latest selinux updates, NROP will no longer require to reboot the nodes on CR updates unless a node group annottaion is enabled. The new selinux policy is default starting 4.18. However this wasn't backported to older versions so in this commit we adapt this change based on the existance of the annotation and the openshift version. This commit fixes the expected bahavior per each case and adds missing waiting loops for MCP updating conndition. The test passed with default config on 4.18 and on older versions. Signed-off-by: Shereen Haj <[email protected]>

Upgrade `isDegradedInternalError` to `isDegradedWith` to check different reasons instead of checking that the reason is particularly `status.ReasonInternalError`. Signed-off-by: Shereen Haj <[email protected]>

The package has already `isDegradedWith` which verifies if the degraded reason and message is as expected, let's use it instead of duplicating the same check. Signed-off-by: Shereen Haj <[email protected]>

shajmakh · 2025-03-07T08:14:46Z

/unhold

ffromani · 2025-03-07T10:32:24Z

/lgtm

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2025

shajmakh changed the title ~~WIP: serail tests fixes Feb 19~~ WIP: serial tests fixes Feb 19 Feb 19, 2025

openshift-ci bot requested review from mrniranjan and Tal-or February 19, 2025 10:51

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2025

shajmakh force-pushed the serial-fix-2-newversionof47674 branch 2 times, most recently from 067b1b6 to f94b584 Compare February 25, 2025 13:55

shajmakh changed the title ~~WIP: serial tests fixes Feb 19~~ serial tests fixes Feb 19 Feb 26, 2025

openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025

shajmakh changed the title ~~serial tests fixes Feb 19~~ serial tests fixes Feb 26, 2025

shajmakh commented Feb 26, 2025

View reviewed changes

test/e2e/serial/tests/configuration.go Outdated Show resolved Hide resolved

shajmakh commented Feb 26, 2025

View reviewed changes

internal/noderesourcetopology/equality.go Outdated Show resolved Hide resolved

shajmakh commented Feb 26, 2025

View reviewed changes

internal/noderesourcetopology/equality.go Outdated Show resolved Hide resolved

ffromani reviewed Feb 26, 2025

View reviewed changes

Tal-or reviewed Mar 3, 2025

View reviewed changes

openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2025

shajmakh force-pushed the serial-fix-2-newversionof47674 branch from f94b584 to 9cf739c Compare March 6, 2025 13:52

openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2025

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2025

shajmakh added 7 commits March 7, 2025 10:00

serial:test-47674 logs improvments:add MCPInfo ToString()

856cc49

Add and use the MCPInfo.ToString() and print out the built info. Signed-off-by: Shereen Haj <[email protected]>

serial:47674: use NRT nodes instead of workers

59c1214

This shouldn't be an issue if the original MCP that is configured in NROP is targeting worker nodes (MCP=worker); Still don't depend on this assumption and use the NRT nodes. Signed-off-by: Shereen Haj <[email protected]>

serial: generalize function to check different degraded reasons

331d1aa

Upgrade `isDegradedInternalError` to `isDegradedWith` to check different reasons instead of checking that the reason is particularly `status.ReasonInternalError`. Signed-off-by: Shereen Haj <[email protected]>

serial: use helper function to validate degraded status

4f7d115

The package has already `isDegradedWith` which verifies if the degraded reason and message is as expected, let's use it instead of duplicating the same check. Signed-off-by: Shereen Haj <[email protected]>

shajmakh force-pushed the serial-fix-2-newversionof47674 branch from 9cf739c to 4f7d115 Compare March 7, 2025 08:01

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2025

openshift-ci bot assigned ffromani Mar 7, 2025

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 7, 2025

openshift-merge-bot bot merged commit 3e11b78 into openshift-kni:main Mar 7, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

serial tests fixes #1195

serial tests fixes #1195

shajmakh commented Feb 19, 2025

openshift-ci bot commented Feb 19, 2025

shajmakh commented Feb 26, 2025

shajmakh commented Feb 26, 2025

ffromani left a comment

ffromani Feb 26, 2025

shajmakh Mar 6, 2025

ffromani Feb 26, 2025

shajmakh Mar 6, 2025

ffromani Mar 6, 2025

ffromani Mar 6, 2025

Tal-or left a comment

Tal-or Mar 3, 2025

shajmakh Mar 6, 2025

Tal-or Mar 6, 2025

Tal-or Mar 6, 2025 •

edited

Loading

ffromani Mar 6, 2025 •

edited

Loading

Tal-or Mar 6, 2025 •

edited

Loading

shajmakh Mar 6, 2025

ffromani Mar 6, 2025

shajmakh Mar 6, 2025

ffromani commented Mar 6, 2025

shajmakh commented Mar 7, 2025

ffromani commented Mar 7, 2025

		@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration {
		}

		func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) {

serial tests fixes #1195

serial tests fixes #1195

Conversation

shajmakh commented Feb 19, 2025

openshift-ci bot commented Feb 19, 2025

shajmakh commented Feb 26, 2025

shajmakh commented Feb 26, 2025

ffromani left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Tal-or Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

ffromani Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Tal-or Mar 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffromani commented Mar 6, 2025

shajmakh commented Mar 7, 2025

ffromani commented Mar 7, 2025

Tal-or Mar 6, 2025 •

edited

Loading

ffromani Mar 6, 2025 •

edited

Loading

Tal-or Mar 6, 2025 •

edited

Loading