Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

serial tests fixes #1195

Conversation

shajmakh
Copy link
Member

please see commits for more details.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2025
@shajmakh shajmakh changed the title WIP: serail tests fixes Feb 19 WIP: serial tests fixes Feb 19 Feb 19, 2025
@openshift-ci openshift-ci bot requested review from mrniranjan and Tal-or February 19, 2025 10:51
Copy link
Contributor

openshift-ci bot commented Feb 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: shajmakh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2025
@shajmakh shajmakh force-pushed the serial-fix-2-newversionof47674 branch 2 times, most recently from 067b1b6 to f94b584 Compare February 25, 2025 13:55
@shajmakh
Copy link
Member Author

proposal for rewriting the test is found here: #1196

@shajmakh shajmakh changed the title WIP: serial tests fixes Feb 19 serial tests fixes Feb 19 Feb 26, 2025
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 26, 2025
@shajmakh shajmakh changed the title serial tests fixes Feb 19 serial tests fixes Feb 26, 2025
@shajmakh
Copy link
Member Author

/retest

Copy link
Member

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mixed bag. Some commits are clear improvements, some other raises some question.
I can see the reason for all the changes, but some implementations can use some polishing, probably.

@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration {
}

func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good idea but I'm puzzled by the implementation. We don't have GinkgoHelper() calls anymore, so failures should point to the specific test, should not they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without the GinkgoHelper() call in a "helper" function, one will be able to see the exact failing line in the nested function and not the line of where this function is being called.
IOW if inside a spec there is a call for a helper function in line 5, and that helper function has GinkgoHelper(); if the function fails on one of its assertions say line 91 (let's assume it's in the same file), the report will print out that the spec has failed in line 5. Whereas if GinkgoHelper() was to called, the report will print out that the spec has failed in line 91.

@@ -1510,3 +1523,19 @@ func getLabelRoleWorker() string {
func getLabelRoleMCPTest() string {
return fmt.Sprintf("%s/%s", depnodes.LabelRole, roleMCPTest)
}

func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool {
GinkgoHelper()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so here we do want to call GinkgoHelper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main purpose of GinkgoHelper() is to avoid using offset() for location calculation of the running code. It is especially helpful to be called on functions that are highly reused or are called several times in one spec. In that case it would be more clear to debug failures to get the location of the failing call rather than the failing line within the helper function.
Considering this helper function is short and well defined I didn't mind to add the GinkgoHelper() call here. If this causes confusion we can remove it and address all that whenever the function is highly requested, then we can propose a cleanup PR to add this call if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure I follow the logic, but this specific application makes sense to me, so let's move on

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thing is: the problem I'm seeing (and which prompted the previous conversation?) was related to bad helpers which had too many expectations inside them and whose failure was hard to debug. This is still kinda related to the helper length, but the key discriminatror should be number and likeness of failure of expectations inside the helper rather than the helper length itself. Anyway, overall makes sense to me so we can discuss nuances later on.

Copy link
Collaborator

@Tal-or Tal-or left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a lot of good improvements, good work

func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool {
GinkgoHelper()

const minCustomSupportingVString = "4.18"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commit is going to be merge to 4.19 and (probably) backport to 4.18, so why there's a need to check if this code is running against a version older than 4.18

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we use and maintain same test image for all versions, so this code will also run on < 4.18 versions, and we do not backport changes done to the serial suite

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What test image?
I thought that we are checking out the code according to the cluster version and then run the appropriate tests.

Copy link
Collaborator

@Tal-or Tal-or Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be problematic to run tests adjusted for 4.19 on 4.17 because there are significant changes between the two. for example we no longer have MachineConfig

Copy link
Member

@ffromani ffromani Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we are checking out the code according to the cluster version and then run the appropriate tests.

yes we do that in the test image. This means the latest (main branch) version of the suite needs to detect the operator capabilities and adjust accordingly.

Copy link
Collaborator

@Tal-or Tal-or Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you're checking the branch to release-4.17, this piece of code won't be shown there.
What am I missing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tal-or this was implemented u/s eventually to and one time d/s as it turned out the best approach to avoid maintenance load as much as possible, check this for the details:
https://github.com/openshift-kni/numaresources-operator/tree/main/doc/features

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minCustomSupportingVString is hardcoded indeed, but configuration.PlatVersion is autodetected from the cluster the suite is running against

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but if you're checking the branch to release-4.17, this piece of code won't be shown there.
What am I missing?

This piece of code will run on all releases and is shipped in our latest quay tests' image. The tl;dr is that the operator controller pod for every release exposes which features are supported per the running version; d/s we scan the serial tests for all the tests that are tagged with the supported feature, that way we run only supported tests. Active features per branch can be found here:
https://github.com/openshift-kni/numaresources-operator/blob/main/internal/api/features/_topics.json
@Tal-or let me know if you have additional concerns.

@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 4, 2025
@ffromani
Copy link
Member

ffromani commented Mar 6, 2025

please rebase before you resubmit

@shajmakh shajmakh force-pushed the serial-fix-2-newversionof47674 branch from f94b584 to 9cf739c Compare March 6, 2025 13:52
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2025
@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 6, 2025
shajmakh added 7 commits March 7, 2025 10:00
Problem:
Reboot tests fails frequently on teardown when comparing the quantaty of
the device resources. The root cause is from the sample-device plugin
but haven't yet fully investigated considering we **want** to move to
real devices/ sriov simulation and due to capacity.

What do you want?
Make the serial suite less noisy due to to known
issue, until we fix the root cause or implement the alternative:
https://issues.redhat.com/browse/CNF-12824

How are you fixing the problem in this PR?
similarly to memory resources deviation for reboot tests, allow devices
quantity to change after reboot but under the condition that the ratios
are the same before and after the test in the context of allocatables
and availables; That means capacities of the same device may change
before vs after the test, but unequal resources consumption is not
tolerated.

How did you test the change?
On a cluster with sample devices simulate the reboot scenario by running
an empty test with only a time.Sleep() command, once the sleep period
starts, update the devices count manually on the cluster (reapply CM and
restart the DS). The teardown should reflect that the deviation is
noticed but not failing on that.

Signed-off-by: Shereen Haj <[email protected]>
Add and use the MCPInfo.ToString() and print out the built info.

Signed-off-by: Shereen Haj <[email protected]>
Previously using the MCP object that we built manually wasn't wrong
because the specified config is empty anyway; still, do the correct use
by pulling the updated object and use it in the MCPInfo object.

Signed-off-by: Shereen Haj <[email protected]>
This shouldn't be an issue if the original MCP that is configured in
NROP is targeting worker nodes (MCP=worker); Still don't depend on this
assumption and use the NRT nodes.

Signed-off-by: Shereen Haj <[email protected]>
When GinkgoHelper() was present (f56658a)
It was hard to follow the logs and understand which step failed while
waiting for MCP update. Adding unique identifiers to each call helps
track the start and end of each call and enhanced tracking down the root
cause. To guarantee using a unique identifier, we use `time.Now()` as a string.
Additionally report config values and expectations of the call.

Signed-off-by: Shereen Haj <[email protected]>
With the latest selinux updates, NROP will no longer require to reboot
the nodes on CR updates unless a node group annottaion is enabled. The
new selinux policy is default starting 4.18. However this wasn't
backported to older versions so in this commit we adapt this change
based on the existance of the annotation and the openshift version.
This commit fixes the expected bahavior per each case and adds missing
waiting loops for MCP updating conndition.

The test passed with default config on 4.18 and on older versions.

Signed-off-by: Shereen Haj <[email protected]>
Upgrade `isDegradedInternalError` to `isDegradedWith` to check different
reasons instead of checking that the reason is particularly
`status.ReasonInternalError`.

Signed-off-by: Shereen Haj <[email protected]>
The package has already `isDegradedWith` which verifies if the degraded
reason and message is as expected, let's use it instead of duplicating
the same check.

Signed-off-by: Shereen Haj <[email protected]>
@shajmakh shajmakh force-pushed the serial-fix-2-newversionof47674 branch from 9cf739c to 4f7d115 Compare March 7, 2025 08:01
@shajmakh
Copy link
Member Author

shajmakh commented Mar 7, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 7, 2025
@ffromani
Copy link
Member

ffromani commented Mar 7, 2025

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Mar 7, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 3e11b78 into openshift-kni:main Mar 7, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants