fix(model-serving-controller): exclude RoleDeleting entries from acti…#1156
fix(model-serving-controller): exclude RoleDeleting entries from acti…#1156jiahuat wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request fixes a bug where roles in a RoleDeleting status (whose pods are still terminating) were incorrectly counted towards the active replica count, preventing necessary scale-up operations. The fix introduces logic in both manageRoleReplicas and scaleUpRoles to count only active (non-deleting) roles when making scaling decisions. Additionally, a comprehensive suite of unit tests has been added to verify this behavior and prevent future regressions. No review comments were provided, so there is no additional feedback to address.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes a scale-up edge case where RoleDeleting entries in the datastore incorrectly count toward current replicas, preventing new roles from being created while old pods are still terminating.
Changes:
- Update scale-up calculations to use an “active (non-Deleting) role” count when deciding whether/how many new roles to create.
- Add a regression unit test reproducing the
RoleDeletingscale-up block scenario.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/model-serving-controller/controller/model_serving_controller.go | Adjusts replica decision logic to ignore RoleDeleting entries for scale-up calculations and logging. |
| pkg/model-serving-controller/controller/manage_role_replicas_test.go | Adds a regression test ensuring scale-up proceeds when RoleDeleting roles exist in the store. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| } else if len(roleList) > expectedCount { | ||
| klog.V(2).Infof("manageRoleReplicas: scaling DOWN role %s in ServingGroup %s: current=%d, expected=%d", targetRole.Name, groupName, len(roleList), expectedCount) |
| // Count only active (non-Deleting) roles to determine how many new roles to create. | ||
| // roleList may contain RoleDeleting entries whose pods are still terminating; | ||
| // those slots are already accounted for via startingIndex (index conflict avoidance), | ||
| // but must not reduce the number of new roles we need to create. | ||
| activeCount := 0 | ||
| for _, r := range roleList { | ||
| if r.Status != datastore.RoleDeleting { | ||
| activeCount++ | ||
| } | ||
| } |
| // This file contains unit tests for the manageRoleReplicas function, specifically | ||
| // targeting the bug where RoleDeleting entries are counted in len(roleList), which | ||
| // prevents scale-up when the Deleting count makes len(roleList) == expectedCount. | ||
| // | ||
| // Bug location 1 — manageRoleReplicas (model_serving_controller.go ~line 975): | ||
| // | ||
| // roleList, _ := c.store.GetRoleList(...) // returns ALL roles including RoleDeleting | ||
| // if len(roleList) < expectedCount { // BUG: len includes RoleDeleting | ||
| // c.scaleUpRoles(...) // never entered when len==expected | ||
| // } | ||
| // | ||
| // Bug location 2 — scaleUpRoles (~line 901): | ||
| // | ||
| // toCreate := expectedCount - len(roleList) // BUG: same full len, toCreate=0 if called | ||
| // | ||
| // Scenario that triggers the bug: | ||
| // 1. User sets role.Replicas=2, controller scales up: store={infer-0(Running), infer-1(Running)} | ||
| // 2. User sets role.Replicas=1, controller scales down: | ||
| // store={infer-0(Running), infer-1(Deleting)}, infer-1 pod starts terminating | ||
| // 3. User sets role.Replicas=2 again (before infer-1 pod terminates) | ||
| // 4. manageRoleReplicas: len(roleList)=2 == expectedCount=2 → NO scale-up (Bug1) | ||
| // scaleUpRoles: toCreate = 2-2 = 0 → NO pod created (Bug2, blocks even if Bug1 fixed) | ||
| // 5. Scale-up is blocked until infer-1 pod terminates and store.DeleteRole is called | ||
| // 6. After controller restart, syncAll() skips Terminating pods → store={infer-0} | ||
| // → manageRoleReplicas: len=1 < expected=2 → scale-up succeeds | ||
| // | ||
| // Full fix requires: | ||
| // A. In manageRoleReplicas: change `len(roleList) < expectedCount` to `activeCount < expectedCount` | ||
| // where activeCount counts only non-Deleting roles | ||
| // B. In scaleUpRoles: change `toCreate := expectedCount - len(roleList)` to | ||
| // `toCreate := expectedCount - activeCount` (using non-Deleting count from full roleList) | ||
| // while keeping `startingIndex` based on full roleList (to avoid index conflicts with | ||
| // the still-Deleting roles) |
| } else if len(roleList) > expectedCount { | ||
| klog.V(2).Infof("manageRoleReplicas: scaling DOWN role %s in ServingGroup %s: current=%d, expected=%d", targetRole.Name, groupName, len(roleList), expectedCount) |
…ve role count to fix scale-up being silently skipped Signed-off-by: jiahuat <jiahua.zhong@msxf.com>
Test cases have been updated. |
fix(model-serving-controller): exclude RoleDeleting entries from active role count to fix scale-up being silently skipped
What type of PR is this?
/kind bug
What this PR does / why we need it:
When a user increases
role.Replicas(e.g. 1→2) while a previous pod is stillin Terminating state (e.g. stuck with RDMA finalizers), the in-memory store
holds both a
Runningentry and aDeletingentry for the same role.The scale-up logic was using
len(roleList)to compare againstexpectedCount.Because
len()countsRoleDeletingentries, the conditionlen(roleList) < expectedCountevaluated as2 < 2 = false, causing scale-up to be silentlyskipped — the new replica was never created until the controller restarted.
Two locations were affected:
manageRoleReplicas: the scale-up branch condition used the fulllen(roleList).scaleUpRoles:toCreate := expectedCount - len(roleList)computed zero evenif the function was somehow reached.
Fix: introduce
activeRoleCount/activeCountthat excludeRoleDeletingentries, and use these counts for the scale-up condition and
toCreatecalculation.
startingIndexstill uses the fulllen(roleList)to avoidrole-ID collisions with still-Deleting roles.
A regression test is added in
manage_role_replicas_test.gothat directly seedsthe store with one
Running+ oneDeletingentry and asserts thatmanageRoleReplicascreates a third entry; the test fails before the fix andpasses after.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
startingIndexinscaleUpRolesintentionally still useslen(roleList)(fulllist including Deleting entries) to ensure newly created role IDs do not collide
with roles that are still terminating in the cluster.
The scale-down path (
len(roleList) > expectedCount) is unchanged — it alreadycorrectly selects
RoleDeletingroles first for deletion(
PriorityRoleDeleting = 0 < PriorityRoleRunning = 1), andDeleteRolehas anearly-return guard for already-Deleting roles to prevent double-deletion.
Does this PR introduce a user-facing change?: