feat: Support DRA Admin Access #8063

MenD32 · 2025-04-27T18:45:13Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. ResourceClaims support the AdminAccess field which is used to allow cluster administrators to access devices already in use. This changes the CA's business logic by introducing the idea that "some resourceclaims don't reserve their allocated devices".

Which issue(s) this PR fixes:

Fixes #7685

Special notes for your reviewer:

Does this PR introduce a user-facing change?

ResourceClaims with AdminAccess will now be ignored when calculating node utilization for scaledown

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

- [KEP]: https://github.com/kubernetes/enhancements/blob/a55eefc6051d6684d8cc7521e1f4de6319625e23/keps/sig-auth/5018-dra-adminaccess/README.md

Signed-off-by: MenD32 <[email protected]>

k8s-ci-robot · 2025-04-27T18:45:24Z

Hi @MenD32. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2025-04-27T18:45:29Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: MenD32
Once this PR has been reviewed and has the lgtm label, please assign aleksandra-malinowska for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: MenD32 <[email protected]>

…s empty Signed-off-by: MenD32 <[email protected]>

jackfrancis · 2025-04-28T23:26:43Z

cluster-autoscaler/simulator/dynamicresources/utils/utilization.go

@@ -104,9 +104,10 @@ func getAllDevices(slices []*resourceapi.ResourceSlice) []resourceapi.Device {
 func groupAllocatedDevices(claims []*resourceapi.ResourceClaim) (map[string]map[string][]string, error) {
 	result := map[string]map[string][]string{}
 	for _, claim := range claims {
-		alloc := claim.Status.Allocation
+		claimCopy := ClaimWithoutAdminAccessRequests(claim)


It might be cleaner to do this in CalculateDynamicResourceUtilization ? We'd have to wrap an enumerator around ClaimWithoutAdminAccessRequests, but at least we could prune the claims data at the source so that any additional downstream usages of it in the future get that pruning for free. (I don't think that CA would ever want to operate upon claim devices that are tagged for AdminAccess.)

wdyt?

sounds good to me

Do you think other calls to groupAllocatedDevices would expect adminAccess resource requests to be removed?

MenD32 · 2025-04-29T09:57:20Z

cluster-autoscaler/simulator/utilization/info_test.go

@@ -353,6 +353,9 @@ func TestCalculateWithDynamicResources(t *testing.T) {
 			wantErr:      cmpopts.AnyError,
 		},
 	} {
+		if tc.testName != "DRA slices and claims present, DRA enabled -> DRA util returned despite being lower than CPU" {


my bad, that was for debugging the test, i need to remove this

Signed-off-by: MenD32 <[email protected]>

jackfrancis · 2025-04-29T17:06:22Z

/ok-to-test

jackfrancis · 2025-04-29T17:06:47Z

/lgtm

/assign @towca

MenD32 · 2025-05-26T11:51:26Z

@towca, could you please take a look at this PR when you get a chance?

towca · 2025-06-04T18:16:40Z

cluster-autoscaler/simulator/dynamicresources/utils/utilization.go

+	for i, claim := range claims {
+		// remove AdminAccessRequests from the claim before calculating utilization
+		claims[i] = ClaimWithoutAdminAccessRequests(claim)
+	}
 	allocatedDevices, err := groupAllocatedDevices(claims)


Why not just modify groupAllocatedDevices to skip Devices with AdminAccess=true? Seems much simpler and doesn't require any copying.

towca · 2025-06-04T18:44:42Z

cluster-autoscaler/simulator/dynamicresources/utils/utilization_test.go

@@ -141,6 +142,20 @@ func TestDynamicResourceUtilization(t *testing.T) {
 			wantHighestUtilization:     0.2,
 			wantHighestUtilizationName: apiv1.ResourceName(fmt.Sprintf("%s/%s", fooDriver, "pool1")),
 		},
+		{


Could you add a test case with both kinds of claims together?

towca · 2025-06-04T18:45:11Z

cluster-autoscaler/simulator/dynamicresources/utils/utilization_test.go

 				Status: resourceapi.ResourceClaimStatus{
 					Allocation: &resourceapi.AllocationResult{
 						Devices: resourceapi.DeviceAllocationResult{
 							Results: []resourceapi.DeviceRequestAllocationResult{
-								{Request: fmt.Sprintf("request-%d", podDevIndex), Driver: driverName, Pool: poolName, Device: devName},
+								{Request: devReqName, Driver: driverName, Pool: poolName, Device: devName},


The result has an AdminAccess field as well.

towca · 2025-06-04T20:06:25Z

cluster-autoscaler/simulator/dynamicresources/utils/sanitize.go

@@ -72,6 +72,7 @@ func SanitizedPodResourceClaims(newOwner, oldOwner *v1.Pod, claims []*resourceap
 		claimCopy.UID = uuid.NewUUID()
 		claimCopy.Name = fmt.Sprintf("%s-%s", claim.Name, nameSuffix)
 		claimCopy.OwnerReferences = []metav1.OwnerReference{PodClaimOwnerReference(newOwner)}
+		claimCopy = ClaimWithoutAdminAccessRequests(claimCopy)


Hmm, it doesn't feel right modifying the claims to remove the admin-access allocation results. The claims are checked and possibly allocated by the DRA scheduler plugin during Filter. They just don't "block" the allocated devices from being allocated for other claims. If we remove the results we essentially have an "invalid" allocated claim where not all requests have an allocation. Not sure if the DRA scheduler plugin Filters would pass for such a Pod.

IMO we should duplicate the admin-access allocation results that are not Node-local without sanitization. The non-Node-local devices are then "double-booked", but this is fine because admin-access doesn't actually book them. We should still sanitize the Node-local results to avoid pointing to devices that definitely aren't available on the new Node. This should leave the claim in a relatively valid state - Node-local allocations are correctly pointing to the devices from the new Node, non-Node-local allocations point to the same devices as the initial claim did. The only assumption we're making is that if a non-Node-local device is available on oldNode, it will be available on newNode as well.

It seems that we can just slightly modify this function to achieve this:

If a result Pool isn't in oldNodePoolNames but it has admin-access set - add it to sanitizedAllocations as-is instead of returning an error.

I wonder if we can just remove the nodeSelector check, it's pretty redundant with checking against oldNodePoolNames. Otherwise we'd have to move the check after sanitizing result and do something like "don't error if there were any non-Node-local, admin-access results during sanitization".

I'd also definitely add new test cases to unit tests for this function.

feat: Support DRA Admin Access

d2d21f4

Signed-off-by: MenD32 <[email protected]>

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 27, 2025

k8s-ci-robot requested review from vadasambar and x13n April 27, 2025 18:45

k8s-ci-robot added the area/cluster-autoscaler label Apr 27, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 27, 2025

MenD32 added 4 commits April 27, 2025 22:08

tests: added utilization test for adminaccess resourceclaims

e7b9ad4

Signed-off-by: MenD32 <[email protected]>

fix: ran gofmt

76ca2c9

Signed-off-by: MenD32 <[email protected]>

fix: false positives on SanitizedPodResourceClaims

05dcfc6

Signed-off-by: MenD32 <[email protected]>

fix: non admin results get removed from resourceclaim when requests i…

a389604

…s empty Signed-off-by: MenD32 <[email protected]>

jackfrancis reviewed Apr 28, 2025

View reviewed changes

MenD32 commented Apr 29, 2025

View reviewed changes

MenD32 requested a review from jackfrancis April 29, 2025 10:07

feat: Support DRA Admin Access

bece2b0

Signed-off-by: MenD32 <[email protected]>

MenD32 force-pushed the feat/dra-admin-access branch from 94b849e to bece2b0 Compare April 29, 2025 10:23

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 29, 2025

k8s-ci-robot assigned towca and jackfrancis Apr 29, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 29, 2025

towca reviewed Jun 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Support DRA Admin Access #8063

feat: Support DRA Admin Access #8063

MenD32 commented Apr 27, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 27, 2025

Uh oh!

k8s-ci-robot commented Apr 27, 2025

Uh oh!

jackfrancis Apr 28, 2025

Uh oh!

MenD32 Apr 29, 2025

Uh oh!

MenD32 Apr 29, 2025

Uh oh!

MenD32 Apr 29, 2025

Uh oh!

jackfrancis commented Apr 29, 2025

Uh oh!

jackfrancis commented Apr 29, 2025

Uh oh!

MenD32 commented May 26, 2025

Uh oh!

towca Jun 4, 2025

Uh oh!

towca Jun 4, 2025

Uh oh!

towca Jun 4, 2025

Uh oh!

towca Jun 4, 2025

Uh oh!

Uh oh!

feat: Support DRA Admin Access #8063

Are you sure you want to change the base?

feat: Support DRA Admin Access #8063

Conversation

MenD32 commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Apr 27, 2025

Uh oh!

k8s-ci-robot commented Apr 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jackfrancis commented Apr 29, 2025

Uh oh!

jackfrancis commented Apr 29, 2025

Uh oh!

MenD32 commented May 26, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MenD32 commented Apr 27, 2025 •

edited

Loading