Skip to content

add RetryOnConflict to modelbooster#1039

Merged
volcano-sh-bot merged 2 commits into
volcano-sh:mainfrom
FAUST-BENCHOU:test/controller-time-more
May 18, 2026
Merged

add RetryOnConflict to modelbooster#1039
volcano-sh-bot merged 2 commits into
volcano-sh:mainfrom
FAUST-BENCHOU:test/controller-time-more

Conversation

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #1038

Special notes for your reviewer:

Does this PR introduce a user-facing change?:


Copilot AI review requested due to automatic review settings May 13, 2026 15:38
@FAUST-BENCHOU FAUST-BENCHOU marked this pull request as ready for review May 13, 2026 15:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses flakiness in the controller-manager e2e suite (issue #1038) by increasing the Go test timeout so longer-running tests (e.g., rolling update scenarios) are less likely to hit the global test deadline.

Changes:

  • Increased the go test timeout for test-e2e-controller-manager from 10m to 15m.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

e2e test are becoming more and more these days.need to increase the timeout.
I only increase controller-manager e2e for now since others seems working well except itself

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the timeout for controller-manager E2E tests from 10 to 15 minutes in the Makefile. The reviewer suggests adding the -p 1 flag to the test command to ensure tests run sequentially and prevent potential flakiness in the CI environment.

Comment thread Makefile
@command -v kind >/dev/null 2>&1 || { echo "Kind is not installed."; exit 1; }
@TEST_CATEGORY=controller-manager ./test/e2e/setup.sh
@KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=10m ./test/e2e/controller-manager/...
@KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=15m ./test/e2e/controller-manager/...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To ensure that E2E tests run sequentially and do not interfere with each other while sharing the same Kind cluster, consider adding the -p 1 flag. This maintains consistency with the main test-e2e target and helps prevent flakiness in the CI environment.

	@KUBECONFIG=/tmp/kubeconfig-e2e go test -p 1 -v -timeout=15m ./test/e2e/controller-manager/...

@pm-ju
Copy link
Copy Markdown
Contributor

pm-ju commented May 13, 2026

I second this. Faced this problem in many prs.

@pm-ju
Copy link
Copy Markdown
Contributor

pm-ju commented May 13, 2026

I faced this issue in router and gateway-api too. @FAUST-BENCHOU

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

increase the time sucess but

=== RUN   TestModelCR
    model_booster_test.go:43: Created Model CR: kthena-e2e-controller-ay0vx/test-model
    model_booster_test.go:45: 
        	Error Trace:	/home/runner/work/kthena/kthena/test/e2e/controller-manager/model_booster_test.go:45
        	Error:      	Condition never satisfied
        	Test:       	TestModelCR
        	Messages:   	Model did not become Active
--- FAIL: TestModelCR (300.03s)

i remember this test also fails before.no idea why it comes back again

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

I faced this issue in router and gateway-api too. @FAUST-BENCHOU

could u link the relative failed PR or ci? maybe we can increase all of these ci test time in one time

@pm-ju
Copy link
Copy Markdown
Contributor

pm-ju commented May 14, 2026

I faced this issue in router and gateway-api too. @FAUST-BENCHOU

could u link the relative failed PR or ci? maybe we can increase all of these ci test time in one time

#1025
#1021
#1030
#1028

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

I faced this issue in router and gateway-api too. @FAUST-BENCHOU

could u link the relative failed PR or ci? maybe we can increase all of these ci test time in one time

#1025 #1021 #1030 #1028

the failing reason are not the same as controller-manager,they fail because of some specific test

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

/hold
@pm-ju lets disscuss it in issue, i think its not the time issue.I'm checking the controller-manager's log

@FAUST-BENCHOU FAUST-BENCHOU force-pushed the test/controller-time-more branch from d896890 to 9bcf04f Compare May 14, 2026 03:18
@FAUST-BENCHOU FAUST-BENCHOU changed the title increase timeout in controller-manager e2e add RetryOnConflict to modelbooster May 14, 2026
Copilot AI review requested due to automatic review settings May 14, 2026 04:00
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (2)

pkg/model-booster-controller/controller/model_booster_controller.go:269

  • Inside the RetryOnConflict closure, ObservedGeneration is now set to updated.Generation (the generation from the latest object in the lister) rather than the generation of the modelBooster actually being reconciled. If the lister has already observed a newer spec generation than the one whose conditions are being written, this will prematurely advance ObservedGeneration to a generation that has not actually been reconciled. That, in turn, interacts badly with the gating in updateModelBooster (if oldModel.Status.ObservedGeneration != newModel.Generation) and the new early‑return in setModelProcessingCondition (model.Status.ObservedGeneration == model.Generation), both of which can cause the controller to skip reconciling the newer generation. It is safer to set updated.Status.ObservedGeneration = modelBooster.Generation so that the value reflects the generation whose state was actually computed.
		updated.Status.ObservedGeneration = updated.Generation

pkg/model-booster-controller/controller/condition.go:58

  • The early-return short-circuit is only applied to setModelProcessingCondition, but setModelActiveCondition (called every reconcile after success) and setModelInitCondition are still unconditionally invoking updateModelBoosterStatus on every reconcile. That keeps generating unnecessary UpdateStatus calls (and corresponding conflicts/retries) once the ModelBooster is already Active with matching ObservedGeneration. For consistency and to actually eliminate the update churn that motivated this PR, the same "no-op if already in target state" guard should be applied to the other condition setters as well.
func (mc *ModelBoosterController) setModelProcessingCondition(ctx context.Context, model *workloadv1alpha1.ModelBooster) error {
	if meta.IsStatusConditionPresentAndEqual(model.Status.Conditions, string(workloadv1alpha1.ModelStatusConditionTypeActive), metav1.ConditionTrue) &&
		model.Status.ObservedGeneration == model.Generation {
		return nil
	}
	meta.SetStatusCondition(&model.Status.Conditions, newCondition(string(workloadv1alpha1.ModelStatusConditionTypeActive),
		metav1.ConditionFalse, ModelProcessingReason, "ModelBooster not ready yet"))
	if err := mc.updateModelBoosterStatus(ctx, model); err != nil {
		klog.Errorf("update ModelBooster status failed: %v", err)
		return err
	}
	return nil

Comment on lines +261 to +264
latest, err := mc.modelBoosterLister.ModelBoosters(modelBooster.Namespace).Get(modelBooster.Name)
if err != nil {
return err
}
Comment thread Makefile
@command -v kind >/dev/null 2>&1 || { echo "Kind is not installed."; exit 1; }
@TEST_CATEGORY=controller-manager ./test/e2e/setup.sh
@KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=10m ./test/e2e/controller-manager/...
@KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=15m ./test/e2e/controller-manager/...
@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

/unhold

Comment thread Makefile
@command -v kind >/dev/null 2>&1 || { echo "Kind is not installed."; exit 1; }
@TEST_CATEGORY=controller-manager ./test/e2e/setup.sh
@KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=10m ./test/e2e/controller-manager/...
@KUBECONFIG=/tmp/kubeconfig-e2e go test -v -timeout=15m ./test/e2e/controller-manager/...
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the Makefile timeout bump (10m→15m) seems unrelated to the RetryOnConflict change — worth splitting into a separate commit so the history stays clean and the CI change is easy to revert independently if needed.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a seperate commit in fact...

func (mc *ModelBoosterController) setModelProcessingCondition(ctx context.Context, model *workloadv1alpha1.ModelBooster) error {
if meta.IsStatusConditionPresentAndEqual(model.Status.Conditions, string(workloadv1alpha1.ModelStatusConditionTypeActive), metav1.ConditionTrue) &&
model.Status.ObservedGeneration == model.Generation {
return nil
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this guard feels off to me — setModelProcessingCondition is supposed to flip Active to False, but here we bail early if it's already True with a current generation. What if the model genuinely transitions back into processing (e.g. a pod was replaced)? We'd silently skip the status update and leave it looking Active when it isn't. Unless the call site already filters that out?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IC,my bad.This change need to be reverted.

Signed-off-by: zhoujinyu <2319109590@qq.com>
Signed-off-by: zhoujinyu <2319109590@qq.com>
@FAUST-BENCHOU FAUST-BENCHOU force-pushed the test/controller-time-more branch from c408395 to 3c22e25 Compare May 14, 2026 11:13
@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

/retest

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

@FAUST-BENCHOU: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pm-ju
Copy link
Copy Markdown
Contributor

pm-ju commented May 17, 2026

hey @FAUST-BENCHOU did the falling test get resolved with this pr?

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

/gemini review

@FAUST-BENCHOU
Copy link
Copy Markdown
Contributor Author

hey @FAUST-BENCHOU did the falling test get resolved with this pr?

maybe wait till tomorrow for maintainer's feedback.Today is weekend

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the e2e test timeout for the controller manager and implements a retry mechanism for updating ModelBooster status using RetryOnConflict. Feedback suggests improving the retry logic by fetching the latest object version directly from the API client instead of the lister to avoid stale cache issues, correctly assigning the ObservedGeneration to reflect the reconciled version, and ensuring object metadata is synchronized after the update.

Comment thread pkg/model-booster-controller/controller/model_booster_controller.go
@hzxuzhonghu
Copy link
Copy Markdown
Member

/lgtm
/approve

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot merged commit 1acf175 into volcano-sh:main May 18, 2026
20 of 21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky e2e controller-manager

5 participants