Skip to content

fix(model-serving-controller): exclude RoleDeleting entries from acti…#1156

Open
jiahuat wants to merge 1 commit into
volcano-sh:mainfrom
jiahuat:fix/replica
Open

fix(model-serving-controller): exclude RoleDeleting entries from acti…#1156
jiahuat wants to merge 1 commit into
volcano-sh:mainfrom
jiahuat:fix/replica

Conversation

@jiahuat

@jiahuat jiahuat commented May 27, 2026

Copy link
Copy Markdown

fix(model-serving-controller): exclude RoleDeleting entries from active role count to fix scale-up being silently skipped

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a user increases role.Replicas (e.g. 1→2) while a previous pod is still
in Terminating state (e.g. stuck with RDMA finalizers), the in-memory store
holds both a Running entry and a Deleting entry for the same role.

The scale-up logic was using len(roleList) to compare against expectedCount.
Because len() counts RoleDeleting entries, the condition len(roleList) < expectedCount evaluated as 2 < 2 = false, causing scale-up to be silently
skipped — the new replica was never created until the controller restarted.

Two locations were affected:

  1. manageRoleReplicas: the scale-up branch condition used the full len(roleList).
  2. scaleUpRoles: toCreate := expectedCount - len(roleList) computed zero even
    if the function was somehow reached.

Fix: introduce activeRoleCount / activeCount that exclude RoleDeleting
entries, and use these counts for the scale-up condition and toCreate
calculation. startingIndex still uses the full len(roleList) to avoid
role-ID collisions with still-Deleting roles.

A regression test is added in manage_role_replicas_test.go that directly seeds
the store with one Running + one Deleting entry and asserts that
manageRoleReplicas creates a third entry; the test fails before the fix and
passes after.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

startingIndex in scaleUpRoles intentionally still uses len(roleList) (full
list including Deleting entries) to ensure newly created role IDs do not collide
with roles that are still terminating in the cluster.

The scale-down path (len(roleList) > expectedCount) is unchanged — it already
correctly selects RoleDeleting roles first for deletion
(PriorityRoleDeleting = 0 < PriorityRoleRunning = 1), and DeleteRole has an
early-return guard for already-Deleting roles to prevent double-deletion.

Does this PR introduce a user-facing change?:

Fixed a bug where scaling up a ModelServing role was silently skipped when a
previous pod was still terminating: the controller now counts only active
(non-Deleting) roles when deciding whether to create new replicas.

Copilot AI review requested due to automatic review settings May 27, 2026 07:28
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes a bug where roles in a RoleDeleting status (whose pods are still terminating) were incorrectly counted towards the active replica count, preventing necessary scale-up operations. The fix introduces logic in both manageRoleReplicas and scaleUpRoles to count only active (non-deleting) roles when making scaling decisions. Additionally, a comprehensive suite of unit tests has been added to verify this behavior and prevent future regressions. No review comments were provided, so there is no additional feedback to address.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a scale-up edge case where RoleDeleting entries in the datastore incorrectly count toward current replicas, preventing new roles from being created while old pods are still terminating.

Changes:

  • Update scale-up calculations to use an “active (non-Deleting) role” count when deciding whether/how many new roles to create.
  • Add a regression unit test reproducing the RoleDeleting scale-up block scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/model-serving-controller/controller/model_serving_controller.go Adjusts replica decision logic to ignore RoleDeleting entries for scale-up calculations and logging.
pkg/model-serving-controller/controller/manage_role_replicas_test.go Adds a regression test ensuring scale-up proceeds when RoleDeleting roles exist in the store.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1025 to 1026
} else if len(roleList) > expectedCount {
klog.V(2).Infof("manageRoleReplicas: scaling DOWN role %s in ServingGroup %s: current=%d, expected=%d", targetRole.Name, groupName, len(roleList), expectedCount)
Comment on lines +924 to +933
// Count only active (non-Deleting) roles to determine how many new roles to create.
// roleList may contain RoleDeleting entries whose pods are still terminating;
// those slots are already accounted for via startingIndex (index conflict avoidance),
// but must not reduce the number of new roles we need to create.
activeCount := 0
for _, r := range roleList {
if r.Status != datastore.RoleDeleting {
activeCount++
}
}
Comment on lines +19 to +51
// This file contains unit tests for the manageRoleReplicas function, specifically
// targeting the bug where RoleDeleting entries are counted in len(roleList), which
// prevents scale-up when the Deleting count makes len(roleList) == expectedCount.
//
// Bug location 1 — manageRoleReplicas (model_serving_controller.go ~line 975):
//
// roleList, _ := c.store.GetRoleList(...) // returns ALL roles including RoleDeleting
// if len(roleList) < expectedCount { // BUG: len includes RoleDeleting
// c.scaleUpRoles(...) // never entered when len==expected
// }
//
// Bug location 2 — scaleUpRoles (~line 901):
//
// toCreate := expectedCount - len(roleList) // BUG: same full len, toCreate=0 if called
//
// Scenario that triggers the bug:
// 1. User sets role.Replicas=2, controller scales up: store={infer-0(Running), infer-1(Running)}
// 2. User sets role.Replicas=1, controller scales down:
// store={infer-0(Running), infer-1(Deleting)}, infer-1 pod starts terminating
// 3. User sets role.Replicas=2 again (before infer-1 pod terminates)
// 4. manageRoleReplicas: len(roleList)=2 == expectedCount=2 → NO scale-up (Bug1)
// scaleUpRoles: toCreate = 2-2 = 0 → NO pod created (Bug2, blocks even if Bug1 fixed)
// 5. Scale-up is blocked until infer-1 pod terminates and store.DeleteRole is called
// 6. After controller restart, syncAll() skips Terminating pods → store={infer-0}
// → manageRoleReplicas: len=1 < expected=2 → scale-up succeeds
//
// Full fix requires:
// A. In manageRoleReplicas: change `len(roleList) < expectedCount` to `activeCount < expectedCount`
// where activeCount counts only non-Deleting roles
// B. In scaleUpRoles: change `toCreate := expectedCount - len(roleList)` to
// `toCreate := expectedCount - activeCount` (using non-Deleting count from full roleList)
// while keeping `startingIndex` based on full roleList (to avoid index conflicts with
// the still-Deleting roles)
Comment on lines 1025 to 1026
} else if len(roleList) > expectedCount {
klog.V(2).Infof("manageRoleReplicas: scaling DOWN role %s in ServingGroup %s: current=%d, expected=%d", targetRole.Name, groupName, len(roleList), expectedCount)

@nXtCyberNet nXtCyberNet left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pr , i want to add a small suggestion , can you update the first test to check explecity for the new role created name , not required strictly , but it will be a good upgrade the regression coverage.

…ve role count to fix scale-up being silently skipped

Signed-off-by: jiahuat <jiahua.zhong@msxf.com>
Copilot AI review requested due to automatic review settings May 28, 2026 01:57
@jiahuat

jiahuat commented May 28, 2026

Copy link
Copy Markdown
Author

Thanks for pr , i want to add a small suggestion , can you update the first test to check explecity for the new role created name , not required strictly , but it will be a good upgrade the regression coverage.

Test cases have been updated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants