-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
serial tests fixes #1195
serial tests fixes #1195
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: shajmakh The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
067b1b6
to
f94b584
Compare
proposal for rewriting the test is found here: #1196 |
/retest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mixed bag. Some commits are clear improvements, some other raises some question.
I can see the reason for all the changes, but some implementations can use some polishing, probably.
test/e2e/serial/tests/tolerations.go
Outdated
@@ -841,11 +841,18 @@ func sriovToleration() corev1.Toleration { | |||
} | |||
|
|||
func waitForMcpUpdate(cli client.Client, ctx context.Context, mcpsInfo []mcpInfo, updateType MCPUpdateType) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a good idea but I'm puzzled by the implementation. We don't have GinkgoHelper()
calls anymore, so failures should point to the specific test, should not they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without the GinkgoHelper() call in a "helper" function, one will be able to see the exact failing line in the nested function and not the line of where this function is being called.
IOW if inside a spec there is a call for a helper function in line 5, and that helper function has GinkgoHelper(); if the function fails on one of its assertions say line 91 (let's assume it's in the same file), the report will print out that the spec has failed in line 5. Whereas if GinkgoHelper() was to called, the report will print out that the spec has failed in line 91.
@@ -1510,3 +1523,19 @@ func getLabelRoleWorker() string { | |||
func getLabelRoleMCPTest() string { | |||
return fmt.Sprintf("%s/%s", depnodes.LabelRole, roleMCPTest) | |||
} | |||
|
|||
func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool { | |||
GinkgoHelper() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so here we do want to call GinkgoHelper
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main purpose of GinkgoHelper() is to avoid using offset() for location calculation of the running code. It is especially helpful to be called on functions that are highly reused or are called several times in one spec. In that case it would be more clear to debug failures to get the location of the failing call rather than the failing line within the helper function.
Considering this helper function is short and well defined I didn't mind to add the GinkgoHelper() call here. If this causes confusion we can remove it and address all that whenever the function is highly requested, then we can propose a cleanup PR to add this call if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not completely sure I follow the logic, but this specific application makes sense to me, so let's move on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thing is: the problem I'm seeing (and which prompted the previous conversation?) was related to bad helpers which had too many expectations inside them and whose failure was hard to debug. This is still kinda related to the helper length, but the key discriminatror should be number and likeness of failure of expectations inside the helper rather than the helper length itself. Anyway, overall makes sense to me so we can discuss nuances later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a lot of good improvements, good work
func isCustomPolicySupportEnabled(nro *nropv1.NUMAResourcesOperator) bool { | ||
GinkgoHelper() | ||
|
||
const minCustomSupportingVString = "4.18" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This commit is going to be merge to 4.19 and (probably) backport to 4.18, so why there's a need to check if this code is running against a version older than 4.18
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we use and maintain same test image for all versions, so this code will also run on < 4.18 versions, and we do not backport changes done to the serial suite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What test image?
I thought that we are checking out the code according to the cluster version and then run the appropriate tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be problematic to run tests adjusted for 4.19 on 4.17 because there are significant changes between the two. for example we no longer have MachineConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we are checking out the code according to the cluster version and then run the appropriate tests.
yes we do that in the test image. This means the latest (main branch) version of the suite needs to detect the operator capabilities and adjust accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if you're checking the branch to release-4.17, this piece of code won't be shown there.
What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Tal-or this was implemented u/s eventually to and one time d/s as it turned out the best approach to avoid maintenance load as much as possible, check this for the details:
https://github.com/openshift-kni/numaresources-operator/tree/main/doc/features
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minCustomSupportingVString
is hardcoded indeed, but configuration.PlatVersion
is autodetected from the cluster the suite is running against
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but if you're checking the branch to release-4.17, this piece of code won't be shown there.
What am I missing?
This piece of code will run on all releases and is shipped in our latest quay tests' image. The tl;dr is that the operator controller pod for every release exposes which features are supported per the running version; d/s we scan the serial tests for all the tests that are tagged with the supported feature, that way we run only supported tests. Active features per branch can be found here:
https://github.com/openshift-kni/numaresources-operator/blob/main/internal/api/features/_topics.json
@Tal-or let me know if you have additional concerns.
please rebase before you resubmit |
f94b584
to
9cf739c
Compare
Problem: Reboot tests fails frequently on teardown when comparing the quantaty of the device resources. The root cause is from the sample-device plugin but haven't yet fully investigated considering we **want** to move to real devices/ sriov simulation and due to capacity. What do you want? Make the serial suite less noisy due to to known issue, until we fix the root cause or implement the alternative: https://issues.redhat.com/browse/CNF-12824 How are you fixing the problem in this PR? similarly to memory resources deviation for reboot tests, allow devices quantity to change after reboot but under the condition that the ratios are the same before and after the test in the context of allocatables and availables; That means capacities of the same device may change before vs after the test, but unequal resources consumption is not tolerated. How did you test the change? On a cluster with sample devices simulate the reboot scenario by running an empty test with only a time.Sleep() command, once the sleep period starts, update the devices count manually on the cluster (reapply CM and restart the DS). The teardown should reflect that the deviation is noticed but not failing on that. Signed-off-by: Shereen Haj <[email protected]>
Add and use the MCPInfo.ToString() and print out the built info. Signed-off-by: Shereen Haj <[email protected]>
Previously using the MCP object that we built manually wasn't wrong because the specified config is empty anyway; still, do the correct use by pulling the updated object and use it in the MCPInfo object. Signed-off-by: Shereen Haj <[email protected]>
This shouldn't be an issue if the original MCP that is configured in NROP is targeting worker nodes (MCP=worker); Still don't depend on this assumption and use the NRT nodes. Signed-off-by: Shereen Haj <[email protected]>
When GinkgoHelper() was present (f56658a) It was hard to follow the logs and understand which step failed while waiting for MCP update. Adding unique identifiers to each call helps track the start and end of each call and enhanced tracking down the root cause. To guarantee using a unique identifier, we use `time.Now()` as a string. Additionally report config values and expectations of the call. Signed-off-by: Shereen Haj <[email protected]>
With the latest selinux updates, NROP will no longer require to reboot the nodes on CR updates unless a node group annottaion is enabled. The new selinux policy is default starting 4.18. However this wasn't backported to older versions so in this commit we adapt this change based on the existance of the annotation and the openshift version. This commit fixes the expected bahavior per each case and adds missing waiting loops for MCP updating conndition. The test passed with default config on 4.18 and on older versions. Signed-off-by: Shereen Haj <[email protected]>
Upgrade `isDegradedInternalError` to `isDegradedWith` to check different reasons instead of checking that the reason is particularly `status.ReasonInternalError`. Signed-off-by: Shereen Haj <[email protected]>
The package has already `isDegradedWith` which verifies if the degraded reason and message is as expected, let's use it instead of duplicating the same check. Signed-off-by: Shereen Haj <[email protected]>
9cf739c
to
4f7d115
Compare
/unhold |
/lgtm |
please see commits for more details.