test/e2e: fix flaky TestKthenaRouterValidatingWebhook caused by webhook race after pod restart#1065
test/e2e: fix flaky TestKthenaRouterValidatingWebhook caused by webhook race after pod restart#1065nXtCyberNet wants to merge 4 commits into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request exports the WaitForKthenaRouterValidatingWebhook function and enhances its retry logic by including EOF, connection reset by peer, and no endpoints available as transient errors. These changes aim to reduce flakiness during E2E tests when the router pod restarts. Feedback suggests updating the function's documentation to match its new name and refactoring other test components to use this exported implementation instead of maintaining duplicates. Additionally, it is recommended to remove change markers from comments and include 'failed calling webhook' in the retryable error list for improved robustness.
| // deployment. TestRouterConfigUpdate deliberately restarts the kthena-router pod before | ||
| // this test runs. Kubernetes can mark the pod Ready before the webhook handler is fully | ||
| // initialised, so we retry all transient connection errors until the webhook is stable. | ||
| func WaitForKthenaRouterValidatingWebhook(t *testing.T, ctx context.Context, kthenaClient *clientset.Clientset, namespace string) { |
There was a problem hiding this comment.
The function has been renamed to WaitForKthenaRouterValidatingWebhook, but the doc comment on line 35 still refers to the old name waitForKthenaRouterValidatingWebhook. Please update the comment to match. Additionally, now that this function is exported, consider refactoring test/e2e/router/context/context.go to use this implementation instead of its own duplicate waitForRouterValidatingWebhook to ensure consistent behavior across the test suite.
| if strings.Contains(errStr, "connect: connection refused") || | ||
| strings.Contains(errStr, "i/o timeout") || | ||
| strings.Contains(errStr, "context deadline exceeded") { | ||
| strings.Contains(errStr, "context deadline exceeded") || | ||
| strings.Contains(errStr, "EOF") || | ||
| strings.Contains(errStr, "connection reset by peer") || | ||
| strings.Contains(errStr, "no endpoints available") { |
There was a problem hiding this comment.
Consider adding failed calling webhook to the list of retryable error strings. This is a common prefix for errors returned by the Kubernetes API server when a webhook call fails, and it is already handled in the duplicate implementation in test/e2e/router/context/context.go.
| if strings.Contains(errStr, "connect: connection refused") || | |
| strings.Contains(errStr, "i/o timeout") || | |
| strings.Contains(errStr, "context deadline exceeded") { | |
| strings.Contains(errStr, "context deadline exceeded") || | |
| strings.Contains(errStr, "EOF") || | |
| strings.Contains(errStr, "connection reset by peer") || | |
| strings.Contains(errStr, "no endpoints available") { | |
| if strings.Contains(errStr, "failed calling webhook") || | |
| strings.Contains(errStr, "connect: connection refused") || | |
| strings.Contains(errStr, "i/o timeout") || | |
| strings.Contains(errStr, "context deadline exceeded") || | |
| strings.Contains(errStr, "EOF") || | |
| strings.Contains(errStr, "connection reset by peer") || | |
| strings.Contains(errStr, "no endpoints available") { |
Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
a02efc9 to
927ed33
Compare
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Rohan Dev <86916212+nXtCyberNet@users.noreply.github.com>
Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
Signed-off-by: nXtCyberNet <rohantech2005@gmail.com>
|
Keywords which can automatically close issues and at(@) or hashtag(#) mentions are not allowed in commit messages. The list of commits with invalid commit messages:
DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
TestKthenaRouterValidatingWebhookwas intermittently failing in CI due to a racecondition between
TestRouterConfigUpdateandTestKthenaRouterValidatingWebhook.TestRouterConfigUpdatedeliberately deletes and restarts thekthena-routerpod toverify config reload behaviour and always runs immediately before
TestKthenaRouterValidatingWebhook. The validating webhook is not a separate deployment— it is served by the same
kthena-routerpod. After the restart Kubernetes marks thepod
Readybefore the webhook HTTP handler is fully initialised, so the next test startsin an unstable window and hits transient connection errors.
Two minimal fixes:
EOF,connection reset by peer, andno endpoints availableto the retryableerror list in
waitForKthenaRouterValidatingWebhook.EOFis the primary failuremode — without it the test exits instantly on the most common error with no retry.
waitForKthenaRouterValidatingWebhookat the end ofTestRouterConfigUpdateafter the pod restart so the webhook is stable before the next test starts, fixing
the problem at its source rather than defending against its symptoms.
Which issue(s) this PR fixes:
Fixes #1050