-
Notifications
You must be signed in to change notification settings - Fork 4.1k
feat: Support DRA Admin Access #8063
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: MenD32 <[email protected]>
Hi @MenD32. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: MenD32 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: MenD32 <[email protected]>
Signed-off-by: MenD32 <[email protected]>
Signed-off-by: MenD32 <[email protected]>
…s empty Signed-off-by: MenD32 <[email protected]>
@@ -104,9 +104,10 @@ func getAllDevices(slices []*resourceapi.ResourceSlice) []resourceapi.Device { | |||
func groupAllocatedDevices(claims []*resourceapi.ResourceClaim) (map[string]map[string][]string, error) { | |||
result := map[string]map[string][]string{} | |||
for _, claim := range claims { | |||
alloc := claim.Status.Allocation | |||
claimCopy := ClaimWithoutAdminAccessRequests(claim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be cleaner to do this in CalculateDynamicResourceUtilization
? We'd have to wrap an enumerator around ClaimWithoutAdminAccessRequests
, but at least we could prune the claims data at the source so that any additional downstream usages of it in the future get that pruning for free. (I don't think that CA would ever want to operate upon claim devices that are tagged for AdminAccess.)
wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think other calls to groupAllocatedDevices would expect adminAccess resource requests to be removed?
@@ -353,6 +353,9 @@ func TestCalculateWithDynamicResources(t *testing.T) { | |||
wantErr: cmpopts.AnyError, | |||
}, | |||
} { | |||
if tc.testName != "DRA slices and claims present, DRA enabled -> DRA util returned despite being lower than CPU" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my bad, that was for debugging the test, i need to remove this
Signed-off-by: MenD32 <[email protected]>
94b849e
to
bece2b0
Compare
/ok-to-test |
/lgtm /assign @towca |
@towca, could you please take a look at this PR when you get a chance? |
for i, claim := range claims { | ||
// remove AdminAccessRequests from the claim before calculating utilization | ||
claims[i] = ClaimWithoutAdminAccessRequests(claim) | ||
} | ||
allocatedDevices, err := groupAllocatedDevices(claims) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just modify groupAllocatedDevices
to skip Devices with AdminAccess=true? Seems much simpler and doesn't require any copying.
@@ -141,6 +142,20 @@ func TestDynamicResourceUtilization(t *testing.T) { | |||
wantHighestUtilization: 0.2, | |||
wantHighestUtilizationName: apiv1.ResourceName(fmt.Sprintf("%s/%s", fooDriver, "pool1")), | |||
}, | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test case with both kinds of claims together?
Status: resourceapi.ResourceClaimStatus{ | ||
Allocation: &resourceapi.AllocationResult{ | ||
Devices: resourceapi.DeviceAllocationResult{ | ||
Results: []resourceapi.DeviceRequestAllocationResult{ | ||
{Request: fmt.Sprintf("request-%d", podDevIndex), Driver: driverName, Pool: poolName, Device: devName}, | ||
{Request: devReqName, Driver: driverName, Pool: poolName, Device: devName}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The result has an AdminAccess
field as well.
@@ -72,6 +72,7 @@ func SanitizedPodResourceClaims(newOwner, oldOwner *v1.Pod, claims []*resourceap | |||
claimCopy.UID = uuid.NewUUID() | |||
claimCopy.Name = fmt.Sprintf("%s-%s", claim.Name, nameSuffix) | |||
claimCopy.OwnerReferences = []metav1.OwnerReference{PodClaimOwnerReference(newOwner)} | |||
claimCopy = ClaimWithoutAdminAccessRequests(claimCopy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it doesn't feel right modifying the claims to remove the admin-access allocation results. The claims are checked and possibly allocated by the DRA scheduler plugin during Filter
. They just don't "block" the allocated devices from being allocated for other claims. If we remove the results we essentially have an "invalid" allocated claim where not all requests have an allocation. Not sure if the DRA scheduler plugin Filters would pass for such a Pod.
IMO we should duplicate the admin-access allocation results that are not Node-local without sanitization. The non-Node-local devices are then "double-booked", but this is fine because admin-access doesn't actually book them. We should still sanitize the Node-local results to avoid pointing to devices that definitely aren't available on the new Node. This should leave the claim in a relatively valid state - Node-local allocations are correctly pointing to the devices from the new Node, non-Node-local allocations point to the same devices as the initial claim did. The only assumption we're making is that if a non-Node-local device is available on oldNode, it will be available on newNode as well.
It seems that we can just slightly modify this function to achieve this:
- If a result Pool isn't in
oldNodePoolNames
but it has admin-access set - add it tosanitizedAllocations
as-is instead of returning an error. - I wonder if we can just remove the nodeSelector check, it's pretty redundant with checking against
oldNodePoolNames
. Otherwise we'd have to move the check after sanitizing result and do something like "don't error if there were any non-Node-local, admin-access results during sanitization". - I'd also definitely add new test cases to unit tests for this function.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. ResourceClaims support the AdminAccess field which is used to allow cluster administrators to access devices already in use. This changes the CA's business logic by introducing the idea that "some resourceclaims don't reserve their allocated devices".
Which issue(s) this PR fixes:
Fixes #7685
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: