Fix ModelBooster status update conflicts and e2e cleanup race#1053
Fix ModelBooster status update conflicts and e2e cleanup race#1053nXtCyberNet wants to merge 2 commits into
Conversation
Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request implements RetryOnConflict for ModelBooster status updates to handle API conflicts and updates E2E tests with better timeout management and a controller shutdown synchronization helper. Reviewers identified several issues: invalid pseudo-code was left in test_suite_test.go, and the status update implementation uses shallow copies and incorrect ObservedGeneration values, which could cause cache corruption or missed reconciliations.
Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
| if err := waitForControllerManagerToStop(kubeClient, kthenaNamespace, 2*time.Minute); err != nil { | ||
| fmt.Printf("Warning: controller-manager did not fully stop before namespace deletion: %v\n", err) | ||
| } | ||
|
|
There was a problem hiding this comment.
why after uninstallkthena?
There was a problem hiding this comment.
Because before there is a race condition, in which the the goroutine was trying to delete the namespace but because it's already above in lifecycle it creates a panic - in which the pod is terminating and the request was send to it that we have seen in logs - the name dev namespace is already deleted
forbidden: unable to create new content in namespace kthena-e2e-controller-ay0vx because it is being terminated
There was a problem hiding this comment.
i dont think so.U can check other e2e like gateway-inference-extension and gateway-api
if err := testCtx.DeleteTestNamespace(); err != nil {
fmt.Printf("Failed to delete test namespace: %v\n", err)
}
if err := framework.UninstallKthena(config.Namespace); err != nil {
fmt.Printf("Failed to uninstall kthena: %v\n", err)
}
and they never panic.I dont think its the key reason. btw could u explain about it's already above in lifecycle. I don't understand the scene here.
There was a problem hiding this comment.
sorry for the misunderstanding , https://github.com/FAUST-BENCHOU/kthena/actions/runs/25807322965/job/75813641668?pr=24, in this I can't able to find this error in the artifact logs
forbidden: unable to create new content in namespace kthena-e2e-controller-ay0vx because it is being terminated
That you have specified in the comment , in the issue , so I think this is just a run specific error and didnot seen in any other one , so I think it not needed to be considered. So. I will revert this changes. Thanks for pointing out.
What type of PR is this?
/kind bug
/kind cleanup
What this PR does / why we need it:
This PR fixes a race condition in
ModelBoosterstatus updates.Previously, the controller updated status using the existing in-memory
ModelBoosterobject, which could become stale if the resource changed during reconciliation. This could cause Kubernetes conflict errors during status updates.This change updates the status flow to fetch the latest
ModelBoosterfrom the API server, apply the intended status fields, and retry on conflicts usingRetryOnConflict. It also updatesObservedGenerationfrom the latest object and cleans up the LoRA cache only after the status update succeeds.The e2e cleanup flow is also improved by uninstalling Kthena first, waiting for the controller-manager pod to stop, and using a bounded timeout for the rolling update watcher.
Which issue(s) this PR fixes:
Fixes #1038
Does this PR introduce a user-facing change?: