Skip to content

[Feature] Support RoleMaxUnavailable in RoleRollingUpdate#1186

Open
LiZhenCheng9527 wants to merge 9 commits into
volcano-sh:mainfrom
LiZhenCheng9527:role-rolling
Open

[Feature] Support RoleMaxUnavailable in RoleRollingUpdate#1186
LiZhenCheng9527 wants to merge 9 commits into
volcano-sh:mainfrom
LiZhenCheng9527:role-rolling

Conversation

@LiZhenCheng9527

@LiZhenCheng9527 LiZhenCheng9527 commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

What type of PR is this?
/kind feature

What this PR does / why we need it:

Previously, during a RoleRollingUpdate, all outdated roles in the servingGroup were deleted outright. This resulted in the service becoming unavailable when servingGroup.Replicas=1.

This PR introduces the RoleMaxUnavailable field, which sets the increment for role updates during a RoleRollingUpdate. This prevents all OutDatedRole entries from being deleted in one go.

Which issue(s) this PR fixes:
Fixes #1188

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Copilot AI review requested due to automatic review settings June 8, 2026 12:36
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the roleMaxUnavailable configuration field to the RollingUpdateConfiguration for RoleRollingUpdate rollout strategies. This bounds the number of outdated Role replicas that can be simultaneously unavailable within a single ServingGroup during a rolling update, preventing service pauses. The changes include updates to API definitions, controller reconciliation logic, validation webhooks, unit tests, and E2E tests. A critical feedback item was raised regarding a potential nil pointer dereference in collectGroupRoleUpdateState when role.Replicas is omitted, which should be resolved by adding a nil check and defaulting to 1.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

expectedReplicasByRole := make(map[string]int, len(ms.Spec.Template.Roles))
expectedHashByRole := make(map[string]string, len(ms.Spec.Template.Roles))
for _, role := range ms.Spec.Template.Roles {
replicas := int(*role.Replicas)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The role.Replicas field is optional and can be nil. Directly dereferencing it with *role.Replicas without a nil check will cause a panic (nil pointer dereference) if a user omits this field in the spec. Please add a nil check and default to 1 (similar to how it is handled in the validator).

Suggested change
replicas := int(*role.Replicas)
replicas := 1
if role.Replicas != nil {
replicas = int(*role.Replicas)
}

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new rollout budget (roleMaxUnavailable) for RoleRollingUpdate to prevent all outdated role replicas in a ServingGroup from being terminated at once (avoiding service downtime when spec.replicas=1), and wires it through API, validation, controller behavior, and tests.

Changes:

  • Introduces spec.rolloutStrategy.rollingUpdateConfiguration.roleMaxUnavailable (API type, deepcopy/applyconfig, CRD, and docs).
  • Adds webhook validation for roleMaxUnavailable (int/percent, non-zero, only valid for RoleRollingUpdate).
  • Updates the controller’s RoleRollingUpdate eviction logic and adds/updates unit + e2e coverage for bounded role updates.

Reviewed changes

Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
test/e2e/controller-manager/model_serving_test.go Adds e2e tests asserting readiness never drops below the RoleMaxUnavailable-derived minimum during RoleRollingUpdate (single-role and multi-role).
pkg/model-serving-controller/webhook/validator.go Validates roleMaxUnavailable (format, non-zero after scaling, and only allowed for RoleRollingUpdate).
pkg/model-serving-controller/webhook/validator_test.go Adds unit tests for the new validation rules.
pkg/model-serving-controller/utils/utils.go Adds GetRoleMaxUnavailable helper and “unlimited” sentinel.
pkg/model-serving-controller/utils/utils_test.go Adds unit tests for GetRoleMaxUnavailable.
pkg/model-serving-controller/controller/model_serving_controller.go Implements bounded deletion of outdated role replicas during RoleRollingUpdate and adds role status transition to Creating when pods are missing.
pkg/model-serving-controller/controller/model_serving_controller_test.go Reworks tests around RoleRollingUpdate deletions and adds coverage for role-level budget interactions.
pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go Deepcopy support for RoleMaxUnavailable.
pkg/apis/workload/v1alpha1/model_serving_types.go API type/docs for RoleMaxUnavailable.
docs/kthena/docs/reference/crd/workload.serving.volcano.sh.md Documents the new CRD field.
client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go Adds apply-configuration field and builder for RoleMaxUnavailable.
charts/kthena/charts/workload/crds/workload.serving.volcano.sh_modelservings.yaml Updates Helm-packaged CRD schema to include roleMaxUnavailable.
Files not reviewed (2)
  • client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go: Language not supported
  • pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Comment on lines +6833 to +6843
name: "budget reduced by an unready outdated replica",
roleMaxUnavailable: ptr.To(intstr.FromInt(3)),
instances: []roleInstanceFixture{
{roleName: "decode", roleID: "decode-0", outdated: true, ready: true},
{roleName: "decode", roleID: "decode-1", outdated: true, ready: true},
{roleName: "decode", roleID: "decode-2", outdated: true, ready: true},
{roleName: "decode", roleID: "decode-3", outdated: true, ready: false},
{roleName: "prefill", roleID: "prefill-0", ready: true},
},
// unavailable=1 (decode-3 not ready) => budget=3-1=2, only ready outdated are deletable.
expectedDeletions: 2,
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Copilot AI review requested due to automatic review settings June 9, 2026 01:58

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 3 comments.

Files not reviewed (2)
  • client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go: Language not supported
  • pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread pkg/model-serving-controller/webhook/validator.go Outdated
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Copilot AI review requested due to automatic review settings June 9, 2026 12:36

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.

Files not reviewed (2)
  • client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go: Language not supported
  • pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go
Comment thread test/e2e/controller-manager/model_serving_test.go
Comment thread test/e2e/controller-manager/model_serving_test.go
Comment thread test/e2e/controller-manager/model_serving_test.go
Comment thread pkg/model-serving-controller/webhook/validator.go Outdated
Copilot AI review requested due to automatic review settings June 10, 2026 09:30

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.

Files not reviewed (2)
  • client-go/applyconfiguration/workload/v1alpha1/role.go: Language not supported
  • pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported

Comment on lines +1444 to +1448
allRoles, err := c.store.GetRolesByGroup(utils.GetNamespaceName(ms), sg.Name)
if err != nil {
klog.Errorf("failed to get all roles for ServingGroup %s: %v", sg.Name, err)
return nil, 0, false, nil
}
@hzxuzhonghu

Copy link
Copy Markdown
Member

PLease also file an issue

@LiZhenCheng9527

Copy link
Copy Markdown
Collaborator Author

PLease also file an issue

#1188

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go
Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
maxScaleDown := len(servingGroupList) - minAvailable - newServingGroupUnavailableCount
if maxScaleDown <= 0 {

// For RoleRollingUpdate we must keep driving the rollout even when maxScaleDown <= 0: a group

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should maxScaleDown <=0?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible that maxScaleDown == 0.

Fixed

Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Comment thread pkg/apis/workload/v1alpha1/model_serving_types.go Outdated
Comment thread pkg/apis/workload/v1alpha1/servinggroup_types.go Outdated
Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
// Iterate from end to start so the largest ordinals are planned (and later deleted) first.
for i := len(allOutdatedGroups) - 1; i >= 0; i-- {
sg := allOutdatedGroups[i]
states, totalUnavailable, started, err := c.collectGroupRoleUpdateState(ms, sg, revision)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make the vars name meaningful and readable

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go
// role was removed from the spec), or has at least one new-revision replica coexisting with an
// outdated one. Counting new-revision replicas per role (rather than per group) avoids treating an
// unchanged role as progress when another role is merely drained on removal.
deletingByRole := make(map[string]bool)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this mean

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record the deletion of this role is being logged due to a Role RollingUpdate

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Copilot AI review requested due to automatic review settings June 15, 2026 12:50

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.

Files not reviewed (2)
  • client-go/applyconfiguration/workload/v1alpha1/role.go: Generated file
  • pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Generated file

Comment on lines +1431 to +1435
allRoles, err := c.store.GetRolesByGroup(utils.GetNamespaceName(ms), group.Name)
if err != nil {
klog.Errorf("failed to get all roles for ServingGroup %s: %v", group.Name, err)
return nil, 0, false, nil
}
Comment thread test/e2e/controller-manager/model_serving_test.go Outdated
Comment thread pkg/apis/workload/v1alpha1/model_serving_types.go Outdated
Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Comment on lines +1037 to +1040
if c.store.GetRoleStatus(utils.GetNamespaceName(ms), groupName, roleName, roleID) != datastore.RoleRunning {
return false
}
if err := c.store.UpdateRoleStatus(utils.GetNamespaceName(ms), groupName, roleName, roleID, datastore.RoleCreating); err != nil {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though not quite sure why do you need such check
And i find this is not atomic, say if another thread update the status between these two steps.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lock in updateRoleStatus.

This is because, within the Role RollingUpdate, we check the status of the Role to determine whether the rolling update is in progress. Therefore, we need to update the Role in real time based on its current state.

Comment on lines +1281 to +1285
type groupPlan struct {
group datastore.ServingGroup
states []roleUpdateState
totalOutdated int
started bool

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not understandable

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It records whether the rolling update for this group regarding the new revision has already begun.

An outdated replica of a role is being deleted, and the deletion is driven by a rolling update (the template hash has changed or the role has been removed from the spec), or a new revision of a role already exists, but an outdated copy remains (the update is halfway through).

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Copilot AI review requested due to automatic review settings June 16, 2026 07:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Human review recommended

It significantly changes core rollout logic in the controller (budgeting/planning/state derivation), which warrants a final human review despite the added unit/e2e coverage.

Copilot's findings

Files not reviewed (2)

  • client-go/applyconfiguration/workload/v1alpha1/role.go: Generated file
  • pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Generated file
  • Files reviewed: 12/14 changed files
  • Comments generated: 3

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment thread pkg/model-serving-controller/controller/model_serving_controller.go Outdated
Comment thread pkg/apis/workload/v1alpha1/model_serving_types.go
Comment thread pkg/apis/workload/v1alpha1/servinggroup_types.go
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Copilot AI review requested due to automatic review settings June 17, 2026 10:19

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Not ready to approve

There are critical correctness/compilation issues in the updated controller logic (unused parameter causing build failure and a maxUnavailable budget edge case that can be misinterpreted as “unlimited”).

Copilot's findings

Files not reviewed (2)

  • client-go/applyconfiguration/workload/v1alpha1/role.go: Generated file
  • pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Generated file
  • Files reviewed: 13/15 changed files
  • Comments generated: 2

Note

Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.

Comment on lines +1272 to +1277
// promoteServingGroupIfRolledOut flips a ServingGroup that is mid RoleRollingUpdate to Running once
// every role replica is at the new revision (its RoleTemplateHash matches the spec) and Running, the
// replica counts match the spec, and no role removed from the spec is left behind.
// This is a no-op unless the group is currently in RoleRolling status. It returns true when a promotion
func (c *ModelServingController) promoteServingGroupIfRolledOut(ms *workloadv1alpha1.ModelServing, groupName, revision string) bool {
ns := utils.GetNamespaceName(ms)
Comment on lines +1551 to +1553
if maxUnavailable != utils.RoleMaxUnavailableUnlimited {
maxScaleDown = have - (want - maxUnavailable) - newRevUnavailable
}
// available (a pod failed, became NotReady, or a replica is missing). It is a no-op unless the role
// is currently Running. It returns true when a demotion actually happened, so callers can re-enqueue
// the ModelServing to refresh readiness-based accounting such as RoleRollingUpdate.
func (c *ModelServingController) demoteRunningRoleToCreating(ms *workloadv1alpha1.ModelServing, groupName, roleName, roleID string) bool {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return value is not used at all

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

demote not common, prefer update

// early return is preserved as an optimization.
isRoleRollingUpdate := ms.Spec.RolloutStrategy != nil &&
ms.Spec.RolloutStrategy.Type == workloadv1alpha1.RoleRollingUpdate
if maxScaleDown <= 0 && !isRoleRollingUpdate {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isRoleRollingUpdate still continue when maxScaleDown <= 0

// +kubebuilder:validation:Enum={ServingGroupRollingUpdate,RoleRollingUpdate}
Type RolloutStrategyType `json:"type"`

// RollingUpdateConfiguration defines the parameters to be used when type is RollingUpdateStrategyType.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when type is RollingUpdateStrategyType. ? seems not right

Comment on lines +1304 to +1306
for roleName, instances := range allRoles {
if _, ok := expectedRoleNames[roleName]; !ok && len(instances) > 0 {
return false

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check can be largely simplified by checking the role instances count not equal to the expected replicas. And it should be moved in front of the loop in L1285

@hzxuzhonghu

Copy link
Copy Markdown
Member

TestModelServingRoleRollingUpdateRoleMaxUnavailable failure is related. And there are many bad smells.

@hzxuzhonghu hzxuzhonghu left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the RoleRollingUpdate work. I found two issues that should be addressed before merge:

  1. [blocking] Legacy RoleTemplateHash compatibility was dropped

    In pkg/model-serving-controller/controller/model_serving_controller.go, the new rollout paths compare role.RoleTemplateHash directly against the expected hash in promoteServingGroupIfRolledOut and collectGroupRoleUpdateState.

    The previous implementation used resolveRoleTemplateHashForComparison, which handled legacy datastore roles with an empty RoleTemplateHash by resolving the hash from the ServingGroup revision. With the new direct comparison, upgraded controllers can treat existing legacy roles as outdated, unnecessarily delete them, or keep a finished group from being promoted back to Running.

    Please reuse the existing fallback helper in both paths, preserving the previous unresolved-hash behavior.

  2. [important] GetRolesByGroup returns shared mutable *Role pointers

    pkg/model-serving-controller/datastore/store.go copies the role maps in GetRolesByGroup, but it still returns the original *Role values. The new rollout code reads those pointers after the store lock is released, while informer/worker paths can mutate the same objects via UpdateRoleStatus under the store lock.

    That creates an unsynchronized read/write race in rollout-critical state. Please deep-copy each Role in GetRolesByGroup or change the API to return value snapshots so callers never read internal mutable store objects outside the lock.

Restore legacy role template hash fallback during RoleRollingUpdate and return copied role snapshots from the datastore to avoid exposing mutable internal Role pointers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Control the number of unavailable Role replicas in RoleRollingUpdate.

4 participants