add RetryOnConflict to modelbooster#1039
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses flakiness in the controller-manager e2e suite (issue #1038) by increasing the Go test timeout so longer-running tests (e.g., rolling update scenarios) are less likely to hit the global test deadline.
Changes:
- Increased the
go testtimeout fortest-e2e-controller-managerfrom 10m to 15m.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
e2e test are becoming more and more these days.need to increase the timeout. |
There was a problem hiding this comment.
Code Review
This pull request increases the timeout for controller-manager E2E tests from 10 to 15 minutes in the Makefile. The reviewer suggests adding the -p 1 flag to the test command to ensure tests run sequentially and prevent potential flakiness in the CI environment.
| @command -v kind >/dev/null 2>&1 || { echo "Kind is not installed."; exit 1; } | ||
| @TEST_CATEGORY=controller-manager ./test/e2e/setup.sh | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=10m ./test/e2e/controller-manager/... | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=15m ./test/e2e/controller-manager/... |
There was a problem hiding this comment.
To ensure that E2E tests run sequentially and do not interfere with each other while sharing the same Kind cluster, consider adding the -p 1 flag. This maintains consistency with the main test-e2e target and helps prevent flakiness in the CI environment.
@KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 -v -timeout=15m ./test/e2e/controller-manager/...
|
I second this. Faced this problem in many prs. |
|
I faced this issue in |
|
increase the time sucess but i remember this test also fails before.no idea why it comes back again |
could u link the relative failed PR or ci? maybe we can increase all of these ci test time in one time |
|
the failing reason are not the same as controller-manager,they fail because of some specific test |
|
/hold |
d896890 to
9bcf04f
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (2)
pkg/model-booster-controller/controller/model_booster_controller.go:269
- Inside the RetryOnConflict closure,
ObservedGenerationis now set toupdated.Generation(the generation from the latest object in the lister) rather than the generation of themodelBoosteractually being reconciled. If the lister has already observed a newer spec generation than the one whose conditions are being written, this will prematurely advanceObservedGenerationto a generation that has not actually been reconciled. That, in turn, interacts badly with the gating inupdateModelBooster(if oldModel.Status.ObservedGeneration != newModel.Generation) and the new early‑return insetModelProcessingCondition(model.Status.ObservedGeneration == model.Generation), both of which can cause the controller to skip reconciling the newer generation. It is safer to setupdated.Status.ObservedGeneration = modelBooster.Generationso that the value reflects the generation whose state was actually computed.
updated.Status.ObservedGeneration = updated.Generation
pkg/model-booster-controller/controller/condition.go:58
- The early-return short-circuit is only applied to
setModelProcessingCondition, butsetModelActiveCondition(called every reconcile after success) andsetModelInitConditionare still unconditionally invokingupdateModelBoosterStatuson every reconcile. That keeps generating unnecessary UpdateStatus calls (and corresponding conflicts/retries) once the ModelBooster is already Active with matching ObservedGeneration. For consistency and to actually eliminate the update churn that motivated this PR, the same "no-op if already in target state" guard should be applied to the other condition setters as well.
func (mc *ModelBoosterController) setModelProcessingCondition(ctx context.Context, model *workloadv1alpha1.ModelBooster) error {
if meta.IsStatusConditionPresentAndEqual(model.Status.Conditions, string(workloadv1alpha1.ModelStatusConditionTypeActive), metav1.ConditionTrue) &&
model.Status.ObservedGeneration == model.Generation {
return nil
}
meta.SetStatusCondition(&model.Status.Conditions, newCondition(string(workloadv1alpha1.ModelStatusConditionTypeActive),
metav1.ConditionFalse, ModelProcessingReason, "ModelBooster not ready yet"))
if err := mc.updateModelBoosterStatus(ctx, model); err != nil {
klog.Errorf("update ModelBooster status failed: %v", err)
return err
}
return nil
| latest, err := mc.modelBoosterLister.ModelBoosters(modelBooster.Namespace).Get(modelBooster.Name) | ||
| if err != nil { | ||
| return err | ||
| } |
| @command -v kind >/dev/null 2>&1 || { echo "Kind is not installed."; exit 1; } | ||
| @TEST_CATEGORY=controller-manager ./test/e2e/setup.sh | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=10m ./test/e2e/controller-manager/... | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=15m ./test/e2e/controller-manager/... |
|
/unhold |
| @command -v kind >/dev/null 2>&1 || { echo "Kind is not installed."; exit 1; } | ||
| @TEST_CATEGORY=controller-manager ./test/e2e/setup.sh | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=10m ./test/e2e/controller-manager/... | ||
| @KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=15m ./test/e2e/controller-manager/... |
There was a problem hiding this comment.
the Makefile timeout bump (10m→15m) seems unrelated to the RetryOnConflict change — worth splitting into a separate commit so the history stays clean and the CI change is easy to revert independently if needed.
There was a problem hiding this comment.
it is a seperate commit in fact...
| func (mc *ModelBoosterController) setModelProcessingCondition(ctx context.Context, model *workloadv1alpha1.ModelBooster) error { | ||
| if meta.IsStatusConditionPresentAndEqual(model.Status.Conditions, string(workloadv1alpha1.ModelStatusConditionTypeActive), metav1.ConditionTrue) && | ||
| model.Status.ObservedGeneration == model.Generation { | ||
| return nil |
There was a problem hiding this comment.
this guard feels off to me — setModelProcessingCondition is supposed to flip Active to False, but here we bail early if it's already True with a current generation. What if the model genuinely transitions back into processing (e.g. a pod was replaced)? We'd silently skip the status update and leave it looking Active when it isn't. Unless the call site already filters that out?
There was a problem hiding this comment.
IC,my bad.This change need to be reverted.
Signed-off-by: zhoujinyu <2319109590@qq.com>
Signed-off-by: zhoujinyu <2319109590@qq.com>
c408395 to
3c22e25
Compare
|
/retest |
|
@FAUST-BENCHOU: Cannot trigger testing until a trusted user reviews the PR and leaves an DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
hey @FAUST-BENCHOU did the falling test get resolved with this pr? |
|
/gemini review |
maybe wait till tomorrow for maintainer's feedback.Today is weekend |
There was a problem hiding this comment.
Code Review
This pull request increases the e2e test timeout for the controller manager and implements a retry mechanism for updating ModelBooster status using RetryOnConflict. Feedback suggests improving the retry logic by fetching the latest object version directly from the API client instead of the lister to avoid stale cache issues, correctly assigning the ObservedGeneration to reflect the reconciled version, and ensuring object metadata is synchronized after the update.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: hzxuzhonghu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #1038
Special notes for your reviewer:
Does this PR introduce a user-facing change?: