Skip to content

test/e2e: fix flaky TestKthenaRouterValidatingWebhook caused by webhook race after pod restart#1065

Open
nXtCyberNet wants to merge 4 commits into
volcano-sh:mainfrom
nXtCyberNet:issue/routee2e
Open

test/e2e: fix flaky TestKthenaRouterValidatingWebhook caused by webhook race after pod restart#1065
nXtCyberNet wants to merge 4 commits into
volcano-sh:mainfrom
nXtCyberNet:issue/routee2e

Conversation

@nXtCyberNet
Copy link
Copy Markdown

What type of PR is this?
/kind bug

What this PR does / why we need it:
TestKthenaRouterValidatingWebhook was intermittently failing in CI due to a race
condition between TestRouterConfigUpdate and TestKthenaRouterValidatingWebhook.

TestRouterConfigUpdate deliberately deletes and restarts the kthena-router pod to
verify config reload behaviour and always runs immediately before
TestKthenaRouterValidatingWebhook. The validating webhook is not a separate deployment
— it is served by the same kthena-router pod. After the restart Kubernetes marks the
pod Ready before the webhook HTTP handler is fully initialised, so the next test starts
in an unstable window and hits transient connection errors.

Two minimal fixes:

  1. Add EOF, connection reset by peer, and no endpoints available to the retryable
    error list in waitForKthenaRouterValidatingWebhook. EOF is the primary failure
    mode — without it the test exits instantly on the most common error with no retry.
  2. Call waitForKthenaRouterValidatingWebhook at the end of TestRouterConfigUpdate
    after the pod restart so the webhook is stable before the next test starts, fixing
    the problem at its source rather than defending against its symptoms.

Which issue(s) this PR fixes:
Fixes #1050

Copilot AI review requested due to automatic review settings May 15, 2026 08:44
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign yaozengzeng for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request exports the WaitForKthenaRouterValidatingWebhook function and enhances its retry logic by including EOF, connection reset by peer, and no endpoints available as transient errors. These changes aim to reduce flakiness during E2E tests when the router pod restarts. Feedback suggests updating the function's documentation to match its new name and refactoring other test components to use this exported implementation instead of maintaining duplicates. Additionally, it is recommended to remove change markers from comments and include 'failed calling webhook' in the retryable error list for improved robustness.

Comment thread test/e2e/router/webhook_test.go Outdated
// deployment. TestRouterConfigUpdate deliberately restarts the kthena-router pod before
// this test runs. Kubernetes can mark the pod Ready before the webhook handler is fully
// initialised, so we retry all transient connection errors until the webhook is stable.
func WaitForKthenaRouterValidatingWebhook(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, namespace string) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function has been renamed to WaitForKthenaRouterValidatingWebhook, but the doc comment on line 35 still refers to the old name waitForKthenaRouterValidatingWebhook. Please update the comment to match. Additionally, now that this function is exported, consider refactoring test/e2e/router/context/context.go to use this implementation instead of its own duplicate waitForRouterValidatingWebhook to ensure consistent behavior across the test suite.

Comment thread test/e2e/router/webhook_test.go Outdated
Comment thread test/e2e/router/webhook_test.go Outdated
Comment on lines +76 to +81
if strings.Contains(errStr, "connect: connection refused") ||
strings.Contains(errStr, "i/o timeout") ||
strings.Contains(errStr, "context deadline exceeded") {
strings.Contains(errStr, "context deadline exceeded") ||
strings.Contains(errStr, "EOF") ||
strings.Contains(errStr, "connection reset by peer") ||
strings.Contains(errStr, "no endpoints available") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Consider adding failed calling webhook to the list of retryable error strings. This is a common prefix for errors returned by the Kubernetes API server when a webhook call fails, and it is already handled in the duplicate implementation in test/e2e/router/context/context.go.

Suggested change
if strings.Contains(errStr, "connect: connection refused") ||
strings.Contains(errStr, "i/o timeout") ||
strings.Contains(errStr, "context deadline exceeded") {
strings.Contains(errStr, "context deadline exceeded") ||
strings.Contains(errStr, "EOF") ||
strings.Contains(errStr, "connection reset by peer") ||
strings.Contains(errStr, "no endpoints available") {
if strings.Contains(errStr, "failed calling webhook") ||
strings.Contains(errStr, "connect: connection refused") ||
strings.Contains(errStr, "i/o timeout") ||
strings.Contains(errStr, "context deadline exceeded") ||
strings.Contains(errStr, "EOF") ||
strings.Contains(errStr, "connection reset by peer") ||
strings.Contains(errStr, "no endpoints available") {

Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Signed-off-by: Rohan Dev <86916212+nXtCyberNet@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages.

The list of commits with invalid commit messages:

  • 4e3ec10 Update test/e2e/router/webhook_test.go
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

flaky test TestKthenaRouterValidatingWebhook

3 participants