[Feature] Support RoleMaxUnavailable in RoleRollingUpdate#1186
[Feature] Support RoleMaxUnavailable in RoleRollingUpdate#1186LiZhenCheng9527 wants to merge 9 commits into
Conversation
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces the roleMaxUnavailable configuration field to the RollingUpdateConfiguration for RoleRollingUpdate rollout strategies. This bounds the number of outdated Role replicas that can be simultaneously unavailable within a single ServingGroup during a rolling update, preventing service pauses. The changes include updates to API definitions, controller reconciliation logic, validation webhooks, unit tests, and E2E tests. A critical feedback item was raised regarding a potential nil pointer dereference in collectGroupRoleUpdateState when role.Replicas is omitted, which should be resolved by adding a nil check and defaulting to 1.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| expectedReplicasByRole := make(map[string]int, len(ms.Spec.Template.Roles)) | ||
| expectedHashByRole := make(map[string]string, len(ms.Spec.Template.Roles)) | ||
| for _, role := range ms.Spec.Template.Roles { | ||
| replicas := int(*role.Replicas) |
There was a problem hiding this comment.
The role.Replicas field is optional and can be nil. Directly dereferencing it with *role.Replicas without a nil check will cause a panic (nil pointer dereference) if a user omits this field in the spec. Please add a nil check and default to 1 (similar to how it is handled in the validator).
| replicas := int(*role.Replicas) | |
| replicas := 1 | |
| if role.Replicas != nil { | |
| replicas = int(*role.Replicas) | |
| } |
There was a problem hiding this comment.
Pull request overview
Adds a new rollout budget (roleMaxUnavailable) for RoleRollingUpdate to prevent all outdated role replicas in a ServingGroup from being terminated at once (avoiding service downtime when spec.replicas=1), and wires it through API, validation, controller behavior, and tests.
Changes:
- Introduces
spec.rolloutStrategy.rollingUpdateConfiguration.roleMaxUnavailable(API type, deepcopy/applyconfig, CRD, and docs). - Adds webhook validation for
roleMaxUnavailable(int/percent, non-zero, only valid forRoleRollingUpdate). - Updates the controller’s RoleRollingUpdate eviction logic and adds/updates unit + e2e coverage for bounded role updates.
Reviewed changes
Copilot reviewed 10 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
test/e2e/controller-manager/model_serving_test.go |
Adds e2e tests asserting readiness never drops below the RoleMaxUnavailable-derived minimum during RoleRollingUpdate (single-role and multi-role). |
pkg/model-serving-controller/webhook/validator.go |
Validates roleMaxUnavailable (format, non-zero after scaling, and only allowed for RoleRollingUpdate). |
pkg/model-serving-controller/webhook/validator_test.go |
Adds unit tests for the new validation rules. |
pkg/model-serving-controller/utils/utils.go |
Adds GetRoleMaxUnavailable helper and “unlimited” sentinel. |
pkg/model-serving-controller/utils/utils_test.go |
Adds unit tests for GetRoleMaxUnavailable. |
pkg/model-serving-controller/controller/model_serving_controller.go |
Implements bounded deletion of outdated role replicas during RoleRollingUpdate and adds role status transition to Creating when pods are missing. |
pkg/model-serving-controller/controller/model_serving_controller_test.go |
Reworks tests around RoleRollingUpdate deletions and adds coverage for role-level budget interactions. |
pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go |
Deepcopy support for RoleMaxUnavailable. |
pkg/apis/workload/v1alpha1/model_serving_types.go |
API type/docs for RoleMaxUnavailable. |
docs/kthena/docs/reference/crd/workload.serving.volcano.sh.md |
Documents the new CRD field. |
client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go |
Adds apply-configuration field and builder for RoleMaxUnavailable. |
charts/kthena/charts/workload/crds/workload.serving.volcano.sh_modelservings.yaml |
Updates Helm-packaged CRD schema to include roleMaxUnavailable. |
Files not reviewed (2)
- client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go: Language not supported
- pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| name: "budget reduced by an unready outdated replica", | ||
| roleMaxUnavailable: ptr.To(intstr.FromInt(3)), | ||
| instances: []roleInstanceFixture{ | ||
| {roleName: "decode", roleID: "decode-0", outdated: true, ready: true}, | ||
| {roleName: "decode", roleID: "decode-1", outdated: true, ready: true}, | ||
| {roleName: "decode", roleID: "decode-2", outdated: true, ready: true}, | ||
| {roleName: "decode", roleID: "decode-3", outdated: true, ready: false}, | ||
| {roleName: "prefill", roleID: "prefill-0", ready: true}, | ||
| }, | ||
| // unavailable=1 (decode-3 not ready) => budget=3-1=2, only ready outdated are deletable. | ||
| expectedDeletions: 2, |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated 3 comments.
Files not reviewed (2)
- client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go: Language not supported
- pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
9f04368 to
403e6b7
Compare
403e6b7 to
f83b0ec
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 12 changed files in this pull request and generated 5 comments.
Files not reviewed (2)
- client-go/applyconfiguration/workload/v1alpha1/rollingupdateconfiguration.go: Language not supported
- pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported
31054c8 to
7cd4ec7
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 13 changed files in this pull request and generated 1 comment.
Files not reviewed (2)
- client-go/applyconfiguration/workload/v1alpha1/role.go: Language not supported
- pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Language not supported
| allRoles, err := c.store.GetRolesByGroup(utils.GetNamespaceName(ms), sg.Name) | ||
| if err != nil { | ||
| klog.Errorf("failed to get all roles for ServingGroup %s: %v", sg.Name, err) | ||
| return nil, 0, false, nil | ||
| } |
|
PLease also file an issue |
|
| maxScaleDown := len(servingGroupList) - minAvailable - newServingGroupUnavailableCount | ||
| if maxScaleDown <= 0 { | ||
|
|
||
| // For RoleRollingUpdate we must keep driving the rollout even when maxScaleDown <= 0: a group |
There was a problem hiding this comment.
why should maxScaleDown <=0?
There was a problem hiding this comment.
It is possible that maxScaleDown == 0.
Fixed
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
7cd4ec7 to
dc872ee
Compare
| // Iterate from end to start so the largest ordinals are planned (and later deleted) first. | ||
| for i := len(allOutdatedGroups) - 1; i >= 0; i-- { | ||
| sg := allOutdatedGroups[i] | ||
| states, totalUnavailable, started, err := c.collectGroupRoleUpdateState(ms, sg, revision) |
There was a problem hiding this comment.
Please make the vars name meaningful and readable
| // role was removed from the spec), or has at least one new-revision replica coexisting with an | ||
| // outdated one. Counting new-revision replicas per role (rather than per group) avoids treating an | ||
| // unchanged role as progress when another role is merely drained on removal. | ||
| deletingByRole := make(map[string]bool) |
There was a problem hiding this comment.
Record the deletion of this role is being logged due to a Role RollingUpdate
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 14 changed files in this pull request and generated 3 comments.
Files not reviewed (2)
- client-go/applyconfiguration/workload/v1alpha1/role.go: Generated file
- pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Generated file
| allRoles, err := c.store.GetRolesByGroup(utils.GetNamespaceName(ms), group.Name) | ||
| if err != nil { | ||
| klog.Errorf("failed to get all roles for ServingGroup %s: %v", group.Name, err) | ||
| return nil, 0, false, nil | ||
| } |
3fb2cb8 to
65bc070
Compare
| if c.store.GetRoleStatus(utils.GetNamespaceName(ms), groupName, roleName, roleID) != datastore.RoleRunning { | ||
| return false | ||
| } | ||
| if err := c.store.UpdateRoleStatus(utils.GetNamespaceName(ms), groupName, roleName, roleID, datastore.RoleCreating); err != nil { |
There was a problem hiding this comment.
Though not quite sure why do you need such check
And i find this is not atomic, say if another thread update the status between these two steps.
There was a problem hiding this comment.
There is a lock in updateRoleStatus.
This is because, within the Role RollingUpdate, we check the status of the Role to determine whether the rolling update is in progress. Therefore, we need to update the Role in real time based on its current state.
| type groupPlan struct { | ||
| group datastore.ServingGroup | ||
| states []roleUpdateState | ||
| totalOutdated int | ||
| started bool |
There was a problem hiding this comment.
It records whether the rolling update for this group regarding the new revision has already begun.
An outdated replica of a role is being deleted, and the deletion is driven by a rolling update (the template hash has changed or the role has been removed from the spec), or a new revision of a role already exists, but an outdated copy remains (the update is halfway through).
65bc070 to
af07c68
Compare
There was a problem hiding this comment.
⚠️ Human review recommended
It significantly changes core rollout logic in the controller (budgeting/planning/state derivation), which warrants a final human review despite the added unit/e2e coverage.
Copilot's findings
Files not reviewed (2)
- client-go/applyconfiguration/workload/v1alpha1/role.go: Generated file
- pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Generated file
- Files reviewed: 12/14 changed files
- Comments generated: 3
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
Signed-off-by: LiZhenCheng9527 <lizhencheng6@huawei.com>
af07c68 to
aa87175
Compare
There was a problem hiding this comment.
⚠️ Not ready to approve
There are critical correctness/compilation issues in the updated controller logic (unused parameter causing build failure and a maxUnavailable budget edge case that can be misinterpreted as “unlimited”).
Copilot's findings
Files not reviewed (2)
- client-go/applyconfiguration/workload/v1alpha1/role.go: Generated file
- pkg/apis/workload/v1alpha1/zz_generated.deepcopy.go: Generated file
- Files reviewed: 13/15 changed files
- Comments generated: 2
Note
Your feedback helps us improve the quality of this feature.
Please use 👍 or 👎 to tell us whether this assessment is correct.
| // promoteServingGroupIfRolledOut flips a ServingGroup that is mid RoleRollingUpdate to Running once | ||
| // every role replica is at the new revision (its RoleTemplateHash matches the spec) and Running, the | ||
| // replica counts match the spec, and no role removed from the spec is left behind. | ||
| // This is a no-op unless the group is currently in RoleRolling status. It returns true when a promotion | ||
| func (c *ModelServingController) promoteServingGroupIfRolledOut(ms *workloadv1alpha1.ModelServing, groupName, revision string) bool { | ||
| ns := utils.GetNamespaceName(ms) |
| if maxUnavailable != utils.RoleMaxUnavailableUnlimited { | ||
| maxScaleDown = have - (want - maxUnavailable) - newRevUnavailable | ||
| } |
| // available (a pod failed, became NotReady, or a replica is missing). It is a no-op unless the role | ||
| // is currently Running. It returns true when a demotion actually happened, so callers can re-enqueue | ||
| // the ModelServing to refresh readiness-based accounting such as RoleRollingUpdate. | ||
| func (c *ModelServingController) demoteRunningRoleToCreating(ms *workloadv1alpha1.ModelServing, groupName, roleName, roleID string) bool { |
There was a problem hiding this comment.
The return value is not used at all
There was a problem hiding this comment.
demote not common, prefer update
| // early return is preserved as an optimization. | ||
| isRoleRollingUpdate := ms.Spec.RolloutStrategy != nil && | ||
| ms.Spec.RolloutStrategy.Type == workloadv1alpha1.RoleRollingUpdate | ||
| if maxScaleDown <= 0 && !isRoleRollingUpdate { |
There was a problem hiding this comment.
why isRoleRollingUpdate still continue when maxScaleDown <= 0
| // +kubebuilder:validation:Enum={ServingGroupRollingUpdate,RoleRollingUpdate} | ||
| Type RolloutStrategyType `json:"type"` | ||
|
|
||
| // RollingUpdateConfiguration defines the parameters to be used when type is RollingUpdateStrategyType. |
There was a problem hiding this comment.
when type is RollingUpdateStrategyType. ? seems not right
| for roleName, instances := range allRoles { | ||
| if _, ok := expectedRoleNames[roleName]; !ok && len(instances) > 0 { | ||
| return false |
There was a problem hiding this comment.
This check can be largely simplified by checking the role instances count not equal to the expected replicas. And it should be moved in front of the loop in L1285
|
TestModelServingRoleRollingUpdateRoleMaxUnavailable failure is related. And there are many bad smells. |
hzxuzhonghu
left a comment
There was a problem hiding this comment.
Thanks for the RoleRollingUpdate work. I found two issues that should be addressed before merge:
-
[blocking] Legacy
RoleTemplateHashcompatibility was droppedIn
pkg/model-serving-controller/controller/model_serving_controller.go, the new rollout paths comparerole.RoleTemplateHashdirectly against the expected hash inpromoteServingGroupIfRolledOutandcollectGroupRoleUpdateState.The previous implementation used
resolveRoleTemplateHashForComparison, which handled legacy datastore roles with an emptyRoleTemplateHashby resolving the hash from the ServingGroup revision. With the new direct comparison, upgraded controllers can treat existing legacy roles as outdated, unnecessarily delete them, or keep a finished group from being promoted back toRunning.Please reuse the existing fallback helper in both paths, preserving the previous unresolved-hash behavior.
-
[important]
GetRolesByGroupreturns shared mutable*Rolepointerspkg/model-serving-controller/datastore/store.gocopies the role maps inGetRolesByGroup, but it still returns the original*Rolevalues. The new rollout code reads those pointers after the store lock is released, while informer/worker paths can mutate the same objects viaUpdateRoleStatusunder the store lock.That creates an unsynchronized read/write race in rollout-critical state. Please deep-copy each
RoleinGetRolesByGroupor change the API to return value snapshots so callers never read internal mutable store objects outside the lock.
Restore legacy role template hash fallback during RoleRollingUpdate and return copied role snapshots from the datastore to avoid exposing mutable internal Role pointers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Previously, during a RoleRollingUpdate, all outdated roles in the servingGroup were deleted outright. This resulted in the service becoming unavailable when servingGroup.Replicas=1.
This PR introduces the
RoleMaxUnavailablefield, which sets the increment for role updates during aRoleRollingUpdate. This prevents allOutDatedRoleentries from being deleted in one go.Which issue(s) this PR fixes:
Fixes #1188
Special notes for your reviewer:
Does this PR introduce a user-facing change?: