Skip to content

Conversation

perdasilva
Copy link

Promotes NewOLMWebhookProviderOpenshiftServiceCA to GA (configv1.Default)

Copy link
Contributor

openshift-ci bot commented Sep 22, 2025

Hello @perdasilva! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Sep 22, 2025
@perdasilva perdasilva changed the title Promote OLMv1 Webhook support to GA OPRUN-4156: Promote OLMv1 Webhook support to GA Sep 22, 2025
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Sep 22, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Sep 22, 2025

@perdasilva: This pull request references OPRUN-4156 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Promotes NewOLMWebhookProviderOpenshiftServiceCA to GA (configv1.Default)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Sep 22, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign everettraven for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Per G. da Silva <[email protected]>
@everettraven
Copy link
Contributor

@perdasilva Feature promotion criteria looks good aside from the metal-ipv6 (disconnected), which I saw the tests were marked to skip running in disconnected environments.

Is there a particular reason why this feature should/could not be tested in disconnected environments?

@perdasilva
Copy link
Author

perdasilva commented Sep 22, 2025

@perdasilva Feature promotion criteria looks good aside from the metal-ipv6 (disconnected), which I saw the tests were marked to skip running in disconnected environments.

Is there a particular reason why this feature should/could not be tested in disconnected environments?

@everettraven Thanks for 👀'ing =D I can't think of a reason it shouldn't. @joelanford do you remember why we were skipping tests on d/c environments? Is v1 not available on those yet?

@everettraven
Copy link
Contributor

everettraven commented Sep 22, 2025

Ah, I also just noticed that there are 4 tests instead of 5 - are you confident the feature is tested thoroughly and there are no testing gaps?

We have the 5 tests per feature requirement because it generally shouldn't be too difficult to get 5 tests implemented for a feature.

@perdasilva
Copy link
Author

Ah, I also just noticed that there are 4 tests instead of 5 - are you confident the feature is tested thoroughly and there are no testing gaps?

Yeah, that part we're confident with. We couldn't come up with an useful 5th test =S

@joelanford
Copy link
Member

@everettraven, coincidentally, it was you that submitted the PR to have most tests skipped in disconnected environments. It looks like we had tests that were trying to make outbound internet connections (e.g. to registry.redhat.io), so those tests needed to be skipped in disconnected environments.

PR: openshift/origin#29313
Bug: https://issues.redhat.com/browse/OCPBUGS-44810
Slack: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1732125827452889

For the webhook tests, do we rely on the internet to pull catalogs or bundles? If not, we can probably remove those skips.

It looks like the bug was opened, and then subsequently closed. I'm guessing we still have an issue though because, as far as I know, we have not done either of the following:

  • Setup some process to mirror test/catalog bundles from the internet to a disconnected image registry in the disconnected test (not sure this is even feasible)
  • Migrated fully away from reliance on catalogs/bundles in registry.redhat.io.

@perdasilva
Copy link
Author

@everettraven, coincidentally, it was you that submitted the PR to have most tests skipped in disconnected environments. It looks like we had tests that were trying to make outbound internet connections (e.g. to registry.redhat.io), so those tests needed to be skipped in disconnected environments.

PR: openshift/origin#29313 Bug: https://issues.redhat.com/browse/OCPBUGS-44810 Slack: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1732125827452889

For the webhook tests, do we rely on the internet to pull catalogs or bundles? If not, we can probably remove those skips.

It looks like the bug was opened, and then subsequently closed. I'm guessing we still have an issue though because, as far as I know, we have not done either of the following:

* Setup some process to mirror test/catalog bundles from the internet to a disconnected image registry in the disconnected test (not sure this is even feasible)

* Migrated fully away from reliance on catalogs/bundles in registry.redhat.io.

@joelanford oh yeah! doh! come to think of it, we can't run these tests on d/c environments because they pull down the old webhook test operator from v0 atm. We still need that in-tree test operator that we can build and consume within the test itself.

@everettraven
Copy link
Contributor

@everettraven, coincidentally, it was you that submitted the PR to have most tests skipped in disconnected environments. It looks like we had tests that were trying to make outbound internet connections (e.g. to registry.redhat.io), so those tests needed to be skipped in disconnected environments.

PR: openshift/origin#29313 Bug: https://issues.redhat.com/browse/OCPBUGS-44810 Slack: https://redhat-internal.slack.com/archives/C01CQA76KMX/p1732125827452889

Looks like the referenced PR was only for the catalog tests? In hindsight, I'm not sure that was actually the appropriate thing to do, but that change shouldn't have triggered all tests for OLMv1 to be skipped in disconnected environments (although applying the same logic as was done in the past isn't unreasonable :P).

For the webhook tests, do we rely on the internet to pull catalogs or bundles? If not, we can probably remove those skips.

It looks like the bug was opened, and then subsequently closed. I'm guessing we still have an issue though because, as far as I know, we have not done either of the following:

  • Setup some process to mirror test/catalog bundles from the internet to a disconnected image registry in the disconnected test (not sure this is even feasible)
  • Migrated fully away from reliance on catalogs/bundles in registry.redhat.io.

If you know the images you are going to use for catalogs and bundles beforehand, you could potentially use the instructions in https://github.com/openshift/origin/tree/main/test/extended/util/image to get the necessary images mirrored over.

There might be more nuances there, but it should certainly be possible to come up with a test setup that runs in disconnected environments - especially if the feature is going to be supported in said disconnected environments.

Copy link
Contributor

openshift-ci bot commented Sep 22, 2025

@perdasilva: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-scos-e2e-aws-ovn f28546a link false /test okd-scos-e2e-aws-ovn
ci/prow/verify-feature-promotion f28546a link true /test verify-feature-promotion
ci/prow/minor-e2e-upgrade-minor f28546a link true /test minor-e2e-upgrade-minor

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@everettraven
Copy link
Contributor

/assign

@perdasilva
Copy link
Author

There might be more nuances there, but it should certainly be possible to come up with a test setup that runs in disconnected environments - especially if the feature is going to be supported in said disconnected environments.

@everettraven Is this a blocker for this PR? Because this is orthogonal to all OTE tests and one of the goals is to keep this e2e image set small, maybe a case can be made that this ought to be solved across the test suite as a separate action?

A couple of follow-up (and possibly dumb) questions regarding the documentation, in case you can provide some direction:

  • Would you have an example for "Define a standard CI build and publish the image to quay.io in the openshift namespace" for an e2e image?
  • Is "Add the new reference to the init() method in this package" outdated? Should we just add the image to the allowedImages map?

@everettraven
Copy link
Contributor

Is this a blocker for this PR? Because this is orthogonal to all OTE tests and one of the goals is to keep this e2e image set small, maybe a case can be made that this ought to be solved across the test suite as a separate action?

Why should it not be a blocker? Asked in another way - why should this OLM feature receive an exception to this criteria and other features should not?

The case being made for this to be solved as a separate action seemed to be the intention behind the bug that @joelanford had shared (https://issues.redhat.com/browse/OCPBUGS-44810), but instead that ticket was closed and the tests were never updated to work in a disconnected environment. How can we be certain that once this feature is promoted you'll come back and update the testing behavior so we are gathering appropriate signal?

To clarify, I'm not trying to block this just to block it. If there is a clear technical limitation as to why testing in a disconnected environment just cannot be done, or explicit BU override, then the case can be made to override the promotion gate. Otherwise, we need to maintain the promotion bar for all features.

Would you have an example for "Define a standard CI build and publish the image to quay.io in the openshift namespace" for an e2e image?

My interpretation is that this is for an image that doesn't exist upstream but is needed to run your tests. Presumably a standard Prow job to build and push the image to quay.io/openshift is what they are looking for here. If there already exists an image that can be pulled and mirrored by an approver you shouldn't need to do this AFAIK - at least I didn't need to when getting quay.io/keycloak/keycloak:25.0 mirrored recently.

Is "Add the new reference to the init() method in this package" outdated? Should we just add the image to the allowedImages map?

Yeah, I believe this is outdated. I did this recently, so feel free to use openshift/origin#30221 as an example.

@perdasilva
Copy link
Author

Why should it not be a blocker? Asked in another way - why should this OLM feature receive an exception to this criteria and other features should not?

It's fine if it is.

The case being made for this to be solved as a separate action seemed to be the intention behind the bug that @joelanford had shared (https://issues.redhat.com/browse/OCPBUGS-44810), but instead that ticket was closed and the tests were never updated to work in a disconnected environment. How can we be certain that once this feature is promoted you'll come back and update the testing behavior so we are gathering appropriate signal?

I wasn't involved in it, so I cannot comment.

To clarify, I'm not trying to block this just to block it. If there is a clear technical limitation as to why testing in a disconnected environment just cannot be done, or explicit BU override, then the case can be made to override the promotion gate. Otherwise, we need to maintain the promotion bar for all features.

Never meant to imply that you were.

@perdasilva
Copy link
Author

Yeah, I believe this is outdated. I did this recently, so feel free to use openshift/origin#30221 as an example.

We've separated out the OTE tests into our own openshift/operator-framework-operator-controller repository. Do you know how these e2e images are supposed to be registered and used in this case?

@joelanford
Copy link
Member

joelanford commented Sep 23, 2025

My two cents:

  1. We definitely should work to make these tests work in disconnected clusters ASAP. Thanks for identifying this as a concern @everettraven!
  2. Skipping disconnected tests is fairly pervasive in our OTE test suite because tests assume an internet connection. This was found to be acceptable at the time we were GA'ing OLM originally. Just because we accepted something in the past doesn't mean that we should keep accepting it for new things going forward, but our QE has been verifying disconnected use cases outside of origin/OTE, so we do have coverage. It's just outside our OTE test suite.
  3. This particular feature has no disconnected concerns, so risk of skipped disconnected tests for this feature is extremely low.

IMO this concern can be waived so long as:

  1. OLM team agrees to prioritize making ALL tests work in disconnected clusters, starting next sprint.
  2. QE signs off that they are okay with this feature promotion lacking disconnected testing and that they can perform the verification in the meantime until (1) is done.

@everettraven
Copy link
Contributor

everettraven commented Sep 23, 2025

Do you know how these e2e images are supposed to be registered and used in this case?

I do not. The OTE case is new to me. Presumably you can import and use the same library though.

@everettraven
Copy link
Contributor

OLM team agrees to prioritize making ALL tests work in disconnected clusters, starting next sprint.

While it is a nice promise, promises won't protect us from regressions.

QE signs off that they are okay with this feature promotion lacking disconnected testing and that they can perform the verification in the meantime until (1) is done

This is certainly an option, but this exception mechanism is reserved for very important promotions.

If you want to go through this exception process, you'll need to go through the SBAR process and get it approved by the BU that this is a critical enough promotion to warrant this exception. Usually, that also means having a documented plan in place to monitor for regressions equivalently to automated testing.

@joelanford
Copy link
Member

So to clarify my understanding, the policy is:

  • New feature promotions require disconnected testing OR an SBAR + QE verification?
  • All existing GA features that skip disconnected testing are grandfathered in and do not need SBARs?

If so, sounds like we have some unplanned work to do @perdasilva.

For future feature promotions, do we have the policy documented somewhere for which cluster variants are required for promotion? This particular policy for disconnected testing, while it makes sense, is a surprise to me because it was not policy (or at least was not enforced policy) as recently as 4.18.

For next time, I'd like to be able to refer to the official policy when we draft the epic for a feature promotion to make sure that we account for what API reviewers will be looking for before approving promotions.

As an example: right now, OLMv1 is not included in HyperShift at all. That was clearly acceptable in the past, but if the policy changes to "All new features must always promote to SelfManagedHA and Hypershift at the same time", then suddenly we will have a bunch of surprise work to do.

@everettraven
Copy link
Contributor

New feature promotions require disconnected testing OR an SBAR + QE verification?

For the most part, that is my understanding. If there is reasonable justification, such as a true technical limitation or the feature is explicitly not supported in that environment, either Joel or I are probably able to override without an SBAR.

Outside of that, an SBAR will be necessary to have an exception to the promotion criteria. QE verification, or some agreement to test manually, is likely required as part of that SBAR process.

All existing GA features that skip disconnected testing are grandfathered in and do not need SBARs?

We can't retroactively revert the promotion of features that have already GA'd. If you are building features on top of something that circumvented that testing in the past you may end up with additional work to do to retroactively enable that testing for everything.

For future feature promotions, do we have the policy documented somewhere for which cluster variants are required for promotion? This particular policy for disconnected testing, while it makes sense, is a surprise to me because it was not policy (or at least was not enforced policy) as recently as 4.18.

For next time, I'd like to be able to refer to the official policy when we draft the epic for a feature promotion to make sure that we account for what API reviewers will be looking for before approving promotions.

As far as I am aware,

requiredSelfManagedJobVariants = []JobVariant{
is the current canonical source of platform requirements.

I'm actively working on, albeit a bit slowly at the moment, writing up a canonical source for developing features for OpenShift. I'm not aware of any existing comprehensive guide, so I'm hoping to help with that but it will be some time before that is available.

As an example: right now, OLMv1 is not included in HyperShift at all. That was clearly acceptable in the past, but if the policy changes to "All new features must always promote to SelfManagedHA and Hypershift at the same time", then suddenly we will have a bunch of surprise work to do.

This is an entirely separate thing. I'm not aware of any changes, now or in the foreseeable future, for requiring promotion to SelfManagedHA and HCP at the same time (doesn't mean that there won't or can't be).

We are trying to do this from the control plane group side because we have hit some real pain points in having a diff in feature maturity across the two products, but YMMV.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants