fix(model-serving-controller): exclude RoleDeleting entries from acti… by jiahuat · Pull Request #1156 · volcano-sh/kthena

jiahuat · 2026-05-27T07:28:40Z

fix(model-serving-controller): exclude RoleDeleting entries from active role count to fix scale-up being silently skipped

What type of PR is this?

/kind bug

What this PR does / why we need it:

When a user increases role.Replicas (e.g. 1→2) while a previous pod is still
in Terminating state (e.g. stuck with RDMA finalizers), the in-memory store
holds both a Running entry and a Deleting entry for the same role.

The scale-up logic was using len(roleList) to compare against expectedCount.
Because len() counts RoleDeleting entries, the condition len(roleList) < expectedCount evaluated as 2 < 2 = false, causing scale-up to be silently
skipped — the new replica was never created until the controller restarted.

Two locations were affected:

manageRoleReplicas: the scale-up branch condition used the full len(roleList).
scaleUpRoles: toCreate := expectedCount - len(roleList) computed zero even
if the function was somehow reached.

Fix: introduce activeRoleCount / activeCount that exclude RoleDeleting
entries, and use these counts for the scale-up condition and toCreate
calculation. startingIndex still uses the full len(roleList) to avoid
role-ID collisions with still-Deleting roles.

A regression test is added in manage_role_replicas_test.go that directly seeds
the store with one Running + one Deleting entry and asserts that
manageRoleReplicas creates a third entry; the test fails before the fix and
passes after.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

startingIndex in scaleUpRoles intentionally still uses len(roleList) (full
list including Deleting entries) to ensure newly created role IDs do not collide
with roles that are still terminating in the cluster.

The scale-down path (len(roleList) > expectedCount) is unchanged — it already
correctly selects RoleDeleting roles first for deletion
(PriorityRoleDeleting = 0 < PriorityRoleRunning = 1), and DeleteRole has an
early-return guard for already-Deleting roles to prevent double-deletion.

Does this PR introduce a user-facing change?:

Fixed a bug where scaling up a ModelServing role was silently skipped when a
previous pod was still terminating: the controller now counts only active
(non-Deleting) roles when deciding whether to create new replicas.

volcano-sh-bot · 2026-05-27T07:28:51Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request fixes a bug where roles in a RoleDeleting status (whose pods are still terminating) were incorrectly counted towards the active replica count, preventing necessary scale-up operations. The fix introduces logic in both manageRoleReplicas and scaleUpRoles to count only active (non-deleting) roles when making scaling decisions. Additionally, a comprehensive suite of unit tests has been added to verify this behavior and prevent future regressions. No review comments were provided, so there is no additional feedback to address.

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Fixes a scale-up edge case where RoleDeleting entries in the datastore incorrectly count toward current replicas, preventing new roles from being created while old pods are still terminating.

Changes:

Update scale-up calculations to use an “active (non-Deleting) role” count when deciding whether/how many new roles to create.
Add a regression unit test reproducing the RoleDeleting scale-up block scenario.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
pkg/model-serving-controller/controller/model_serving_controller.go	Adjusts replica decision logic to ignore `RoleDeleting` entries for scale-up calculations and logging.
pkg/model-serving-controller/controller/manage_role_replicas_test.go	Adds a regression test ensuring scale-up proceeds when `RoleDeleting` roles exist in the store.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 	} else if len(roleList) > expectedCount {
 		klog.V(2).Infof("manageRoleReplicas: scaling DOWN role %s in ServingGroup %s: current=%d, expected=%d", targetRole.Name, groupName, len(roleList), expectedCount)


+	// Count only active (non-Deleting) roles to determine how many new roles to create.
+	// roleList may contain RoleDeleting entries whose pods are still terminating;
+	// those slots are already accounted for via startingIndex (index conflict avoidance),
+	// but must not reduce the number of new roles we need to create.
+	activeCount := 0
+	for _, r := range roleList {
+		if r.Status != datastore.RoleDeleting {
+			activeCount++
+		}
+	}


+// This file contains unit tests for the manageRoleReplicas function, specifically
+// targeting the bug where RoleDeleting entries are counted in len(roleList), which
+// prevents scale-up when the Deleting count makes len(roleList) == expectedCount.
+//
+// Bug location 1 — manageRoleReplicas (model_serving_controller.go ~line 975):
+//
+//	roleList, _ := c.store.GetRoleList(...)  // returns ALL roles including RoleDeleting
+//	if len(roleList) < expectedCount {       // BUG: len includes RoleDeleting
+//	    c.scaleUpRoles(...)                  // never entered when len==expected
+//	}
+//
+// Bug location 2 — scaleUpRoles (~line 901):
+//
+//	toCreate := expectedCount - len(roleList) // BUG: same full len, toCreate=0 if called
+//
+// Scenario that triggers the bug:
+//  1. User sets role.Replicas=2, controller scales up: store={infer-0(Running), infer-1(Running)}
+//  2. User sets role.Replicas=1, controller scales down:
+//     store={infer-0(Running), infer-1(Deleting)}, infer-1 pod starts terminating
+//  3. User sets role.Replicas=2 again (before infer-1 pod terminates)
+//  4. manageRoleReplicas: len(roleList)=2 == expectedCount=2 → NO scale-up (Bug1)
+//     scaleUpRoles: toCreate = 2-2 = 0 → NO pod created (Bug2, blocks even if Bug1 fixed)
+//  5. Scale-up is blocked until infer-1 pod terminates and store.DeleteRole is called
+//  6. After controller restart, syncAll() skips Terminating pods → store={infer-0}
+//     → manageRoleReplicas: len=1 < expected=2 → scale-up succeeds
+//
+// Full fix requires:
+//   A. In manageRoleReplicas: change `len(roleList) < expectedCount` to `activeCount < expectedCount`
+//      where activeCount counts only non-Deleting roles
+//   B. In scaleUpRoles: change `toCreate := expectedCount - len(roleList)` to
+//      `toCreate := expectedCount - activeCount` (using non-Deleting count from full roleList)
+//      while keeping `startingIndex` based on full roleList (to avoid index conflicts with
+//      the still-Deleting roles)


 	} else if len(roleList) > expectedCount {
 		klog.V(2).Infof("manageRoleReplicas: scaling DOWN role %s in ServingGroup %s: current=%d, expected=%d", targetRole.Name, groupName, len(roleList), expectedCount)


nXtCyberNet

Thanks for pr , i want to add a small suggestion , can you update the first test to check explecity for the new role created name , not required strictly , but it will be a good upgrade the regression coverage.

…ve role count to fix scale-up being silently skipped Signed-off-by: jiahuat <jiahua.zhong@msxf.com>

jiahuat · 2026-05-28T01:59:18Z

Thanks for pr , i want to add a small suggestion , can you update the first test to check explecity for the new role created name , not required strictly , but it will be a good upgrade the regression coverage.

Test cases have been updated.

Copilot AI review requested due to automatic review settings May 27, 2026 07:28

volcano-sh-bot added the kind/bug label May 27, 2026

volcano-sh-bot requested review from YaoZengzeng and hzxuzhonghu May 27, 2026 07:28

volcano-sh-bot added the size/L label May 27, 2026

gemini-code-assist Bot reviewed May 27, 2026

View reviewed changes

Copilot AI reviewed May 27, 2026

View reviewed changes

jiahuat force-pushed the fix/replica branch from debe1bf to 57c3389 Compare May 27, 2026 07:54

nXtCyberNet reviewed May 27, 2026

View reviewed changes

fix(model-serving-controller): exclude RoleDeleting entries from acti…

55f3fe5

…ve role count to fix scale-up being silently skipped Signed-off-by: jiahuat <jiahua.zhong@msxf.com>

Copilot AI review requested due to automatic review settings May 28, 2026 01:57

jiahuat force-pushed the fix/replica branch from 57c3389 to 55f3fe5 Compare May 28, 2026 01:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(model-serving-controller): exclude RoleDeleting entries from acti…#1156

fix(model-serving-controller): exclude RoleDeleting entries from acti…#1156
jiahuat wants to merge 1 commit into
volcano-sh:mainfrom
jiahuat:fix/replica

jiahuat commented May 27, 2026

Uh oh!

volcano-sh-bot commented May 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

nXtCyberNet left a comment •

edited

Loading

Uh oh!

jiahuat commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		} else if len(roleList) > expectedCount {
		klog.V(2).Infof("manageRoleReplicas: scaling DOWN role %s in ServingGroup %s: current=%d, expected=%d", targetRole.Name, groupName, len(roleList), expectedCount)

Conversation

jiahuat commented May 27, 2026

Uh oh!

volcano-sh-bot commented May 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

nXtCyberNet left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiahuat commented May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

nXtCyberNet left a comment •

edited

Loading