-
Notifications
You must be signed in to change notification settings - Fork 572
OCPBUGS-61381: Enable DRA(DynamicResourceAllocation) featuregate by default #2498
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@sairameshv: This pull request references Jira Issue OCPBUGS-61381, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Skipping CI for Draft Pull Request. |
Hello @sairameshv! Some important instructions when contributing to openshift/api: |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test |
@sairameshv: The
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
productScope(kubernetes). | ||
enhancementPR("https://github.com/kubernetes/enhancements/issues/4381"). | ||
enableIn(configv1.DevPreviewNoUpgrade, configv1.TechPreviewNoUpgrade). | ||
enableIn(configv1.Default, configv1.DevPreviewNoUpgrade, configv1.TechPreviewNoUpgrade). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kinda leaning on just dropping this and inheriting the built-in kube features, similar to my leanings here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like the only reason I can think to keep it is if we ever want to disable it, which I don't think we even can because it's GA now...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the other reason being we're nesting other beta features under this overarching feature, which could be reason to keep it cc @tkashem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am in favor of removing the feature gate so the default promotion/mechanism can kick in, this seems easier to maintain and debug imo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed, Could you PTAL again?
/test e2e-gcp |
82843d4
to
fad9674
Compare
FeatureGateDynamicResourceAllocation = newFeatureGate("DynamicResourceAllocation"). | ||
reportProblemsToJiraComponent("scheduling"). | ||
contactPerson("jchaloup"). | ||
productScope(kubernetes). | ||
enhancementPR("https://github.com/kubernetes/enhancements/issues/4381"). | ||
enableIn(configv1.DevPreviewNoUpgrade, configv1.TechPreviewNoUpgrade). | ||
mustRegister() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have downstream fields depending on this gate, can you try instead the promotion process instead of removal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense
Is there a way to deprecate the featuregate and finally remove it going forward?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the typical promotion process is to add it in the Default set and remove from tech and dev preview?
I believe, it would be cleaner and easier to maintain for us if we can remove the o/api gate. (In 1.34 the feature is GA and enabled by default)
I did a quick scan of the openshift org with "DynamicResourceAllocation", and found the following references:
- a) https://github.com/openshift/cluster-kube-apiserver-operator/blob/6333489fd7d8d3494372cb830efba40eb28e45c1/pkg/operator/configobservation/apienablement/observe_runtime_config.go#L23
from 4.21, we don't enable any beta apis. before 4.21 DRA was in {tech|dev}preview with beta apis, we don't support upgrade from a tech preview cluster. so we don't have the burden of supporting beta resource apis in 4.21 or onward
c) https://github.com/openshift/multus-cni/blob/cf0f68ec2b5fe9bc72d0da325e02cf63968747fe/e2e/setup_cluster.sh#L37
(this is an argument to kubelet, so no need to remove, we only remove it when the featuregate is removed from upstream)
d) hypershift:
https://github.com/openshift/hypershift/blob/6e412e3e6bc73b7a021edeb48cc8749716d8be4c/api/hypershift/v1beta1/featuregates/featureGate-Hypershift-Default.yaml#L44
https://github.com/openshift/hypershift/blob/6e412e3e6bc73b7a021edeb48cc8749716d8be4c/api/hypershift/v1beta1/zz_generated.featuregated-crd-manifests.yaml#L98
e) https://github.com/openshift/kubernetes-autoscaler/blob/d883d0e6dbb74f0839631ebc7a584669f0e955a3/cluster-autoscaler/config/autoscaling_options.go#L309
(I don't think it's relevant to the feature gate in o/api)
we can run payload/aggregation job with this pr and other PRs that remove the references to get signal, does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at some point(may be in future as well), we should be removing the featuregate references from the downstream. I'm fine with raising PRs to other repos now. I'm just looking for any way defined/followed for any featuregate that's deprecated/removed.
cc: @JoelSpeed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the typical promotion process is to add it in the Default set and remove from tech and dev preview?
Normally, for a downstream only feature we would promote it to default (as well as TP and DP). Once that promotion has shipped, we can then remove the gate completely from the product.
For this, you have created a downstream API, so I'd expect you to follow the same process as if this were a downstream gate.
Is there any DRA testing that shows in sippy that the feature is working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Joel is out through next Friday I believe.
This sounds like a predicament we put ourselves in by defining a feature gate for an upstream feature so we can control the conditions in which it is enabled in OpenShift and then relying on that feature gate in a downstream-only API (the reference I found:
api/config/v1/types_scheduling.go
Lines 50 to 53 in 8a46f74
// profileCustomizations contains configuration for modifying the default behavior of existing scheduler profiles. | |
// +openshift:enable:FeatureGate=DynamicResourceAllocation | |
// +optional | |
ProfileCustomizations ProfileCustomizations `json:"profileCustomizations"` |
I agree with Joel here that because we clearly have downstream functionality depending on this feature gate, the appropriate approach to promote this would be to have the appropriate automated regression testing to cover the downstream-specific behaviors before promoting this. That would entail putting the appropriate CI tests in openshift/origin and meeting the same promotion criteria that every other feature on OpenShift has to meet.
Regarding the blocking of the 1.34 rebase - is it possible to configure the DRA tests to only be run when we detect the feature gate is enabled? We would essentially skip DRA testing in the default jobs and only run them in TPNU jobs?
If not, an alternative that we might want to discuss is adding a new TPNU gate to keep the downstream-only changes gated so there is a gap where the upstream feature may be enabled on the cluster but nothing downstream will be utilizing that functionality unless $newGate is enabled. That being said, I'm newer to the feature-gate promotion realm and I have no idea what the nuances here may be. I'd probably recommend getting some thoughts from the openshift architects on this approach if we decide we need to consider going this direction.
Regarding https://github.com/openshift/cluster-kube-apiserver-operator/blob/6333489fd7d8d3494372cb830efba40eb28e45c1/pkg/operator/configobservation/apienablement/observe_runtime_config.go#L23-L27 - is there anything special we need to do here when promoting to ensure that we are properly serving the stable API as opposed to the alpha/beta APIs?
I seem to recall we had an issue related to accidentally serving the beta API still when we promoted the ValidatingAdmissionPolicy gate that led to us needing to do some heavier lifting to remove it (see #2306)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I'm not sure if this is how it has historically been done, but it certainly seems like we should try to avoid cases where we have an upstream specific feature gate explicitly tied to downstream-specific functionality. Instead, it seems we should be using separate gates from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Joel here that because we clearly have downstream functionality depending on this feature gate, the appropriate approach to promote this would be to have the appropriate automated regression testing to cover the downstream-specific behaviors before promoting this. That would entail putting the appropriate CI tests in openshift/origin and meeting the same promotion criteria that every other feature on OpenShift has to meet.
Okay
Regarding the blocking of the 1.34 rebase - is it possible to configure the DRA tests to only be run when we detect the feature gate is enabled? We would essentially skip DRA testing in the default jobs and only run them in TPNU jobs?
Sure, as of now the tests can be skipped in the default jobs and can be enabled for TPNU jobs to unblock the rebase. Once we get some sign from the job runs, we can plan to get this PR in. I would work on that
Also I'm not sure if this is how it has historically been done, but it certainly seems like we should try to avoid cases where we have an upstream specific feature gate explicitly tied to downstream-specific functionality. Instead, it seems we should be using separate gates from the start.
Yes, I agree with using atleast a different name from that of the upstream featuregate to avoid such situations. I don't know if it is makes sense and possible, but it would be great to have a way to fetch the upstream featuregates up-front and validate the new featuregate names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should try to avoid cases where we have an upstream specific feature gate explicitly tied to downstream-specific functionality
Our operators that deploy the components that use upstream feature gates need to consume the same gates when operational decisions depend on them. For example, the MutatingAdmissionPolicy feature is consumed by API servers and requires that kube-apiserver serves non-default API group versions to function (https://kubernetes.io/docs/reference/access-authn-authz/mutating-admission-policy/):
To use the feature, enable the MutatingAdmissionPolicy feature gate (which is off by default) and set --runtime-config=admissionregistration.k8s.io/v1beta1=true on the kube-apiserver.
The MutatingAdmissionPolicy feature gate is 1:1 with the operator behavior, and introducing a second feature gate for the operator to consume only introduces the possibility of skew between the operator and operand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the additional context on why we need to use a shared gate - TIL.
Outside of that semantic, the rest of the requirement still holds for promotion.
- DRA has been graduated to GA in upstream 1.34 downstream - This is also required to unblok the e2e test failures encountered during the OCP-4.21 rebase Reference for the failure: https://issues.redhat.com/browse/OCPBUGS-61381 Signed-off-by: Sai Ramesh Vanka <[email protected]>
fad9674
to
6650897
Compare
@sairameshv: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Just FYI,
Following is the output of DRA jobs passing for reference from the output log of the techpreview job
I'm also monitoring the o/k PR's techpreview jobs to get the results of the |
Regarding the multiple different naming conventions, can we use anything similar to the logic outlined in https://github.com/openshift-eng/openshift-tests-extension/blob/d81c090588352f8cd80934dc61e10b443feca7d7/cmd/example-tests/main.go#L89-L120 to do some labelling/renaming so that all the tests follow enablement/disablement based on the state of the feature gate? Our test tooling should already know how to filter tests based on enabled/disabled feature gates as long as the test spec names are correctly annotated with the feature gate. Having the correct annotations injected would also make it so that the tests report into Sippy in a way that |
- "" | ||
- LowNodeUtilization | ||
- HighNodeUtilization | ||
- NoScoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they don't seem to be related to DynamicResourceAllocation, is it safe for these to be deleted? does it have any impact on the upgrade? cc @ingvagabund
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/hold Please don't merge this until we merge @tkashem's cluster-kube-apiserver-operator PRs. We could end up in another situation where OpenShift serves default-disabled APIs in a GA build. |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
DRA has been graduated to GA in upstream 1.34[1] and is required to be enabled for OCP-4.21 rebase to unblock the DRA based e2e test failures[2]
[1] - https://kubernetes.io/blog/2025/09/01/kubernetes-v1-34-dra-updates/
[2] - https://issues.redhat.com/browse/OCPBUGS-61381