Skip to content

Fix ModelBooster status update conflicts and e2e cleanup race#1053

Open
nXtCyberNet wants to merge 2 commits into
volcano-sh:mainfrom
nXtCyberNet:issue/e2e_flakyness
Open

Fix ModelBooster status update conflicts and e2e cleanup race#1053
nXtCyberNet wants to merge 2 commits into
volcano-sh:mainfrom
nXtCyberNet:issue/e2e_flakyness

Conversation

@nXtCyberNet
Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind bug
/kind cleanup

What this PR does / why we need it:

This PR fixes a race condition in ModelBooster status updates.

Previously, the controller updated status using the existing in-memory ModelBooster object, which could become stale if the resource changed during reconciliation. This could cause Kubernetes conflict errors during status updates.

This change updates the status flow to fetch the latest ModelBooster from the API server, apply the intended status fields, and retry on conflicts using RetryOnConflict. It also updates ObservedGeneration from the latest object and cleans up the LoRA cache only after the status update succeeds.

The e2e cleanup flow is also improved by uninstalling Kthena first, waiting for the controller-manager pod to stop, and using a bounded timeout for the rolling update watcher.

Which issue(s) this PR fixes:

Fixes #1038

Does this PR introduce a user-facing change?:

NONE

Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
Copilot AI review requested due to automatic review settings May 14, 2026 17:35
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hzxuzhonghu for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@nXtCyberNet nXtCyberNet changed the title added retry on update Fix ModelBooster status update conflicts and e2e cleanup race May 14, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@nXtCyberNet nXtCyberNet requested a review from Copilot May 14, 2026 17:37
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements RetryOnConflict for ModelBooster status updates to handle API conflicts and updates E2E tests with better timeout management and a controller shutdown synchronization helper. Reviewers identified several issues: invalid pseudo-code was left in test_suite_test.go, and the status update implementation uses shallow copies and incorrect ObservedGeneration values, which could cause cache corruption or missed reconciliations.

Comment thread test/e2e/controller-manager/test_suite_test.go Outdated
Comment thread pkg/model-booster-controller/controller/model_booster_controller.go
Comment thread pkg/model-booster-controller/controller/model_booster_controller.go Outdated
Comment thread pkg/model-booster-controller/controller/model_booster_controller.go Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
Comment on lines +81 to +84
if err := waitForControllerManagerToStop(kubeClient, kthenaNamespace, 2*time.Minute); err != nil {
fmt.Printf("Warning: controller-manager did not fully stop before namespace deletion: %v\n", err)
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why after uninstallkthena?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because before there is a race condition, in which the the goroutine was trying to delete the namespace but because it's already above in lifecycle it creates a panic - in which the pod is terminating and the request was send to it that we have seen in logs - the name dev namespace is already deleted

forbidden: unable to create new content in namespace kthena-e2e-controller-ay0vx because it is being terminated

Copy link
Copy Markdown
Contributor

@FAUST-BENCHOU FAUST-BENCHOU May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think so.U can check other e2e like gateway-inference-extension and gateway-api

	if err := testCtx.DeleteTestNamespace(); err != nil {
		fmt.Printf("Failed to delete test namespace: %v\n", err)
	}

	if err := framework.UninstallKthena(config.Namespace); err != nil {
		fmt.Printf("Failed to uninstall kthena: %v\n", err)
	}

and they never panic.I dont think its the key reason. btw could u explain about it's already above in lifecycle. I don't understand the scene here.

Copy link
Copy Markdown
Contributor Author

@nXtCyberNet nXtCyberNet May 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the misunderstanding , https://github.com/FAUST-BENCHOU/kthena/actions/runs/25807322965/job/75813641668?pr=24, in this I can't able to find this error in the artifact logs

forbidden: unable to create new content in namespace kthena-e2e-controller-ay0vx because it is being terminated

That you have specified in the comment , in the issue , so I think this is just a run specific error and didnot seen in any other one , so I think it not needed to be considered. So. I will revert this changes. Thanks for pointing out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky e2e controller-manager

4 participants