-
Notifications
You must be signed in to change notification settings - Fork 103
OCPBUGS-38120: Improve error messages for project Delete errors #520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@vrutkovs: This pull request references Jira Issue OCPBUGS-56736, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial |
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/41a86440-3bcb-11f0-9228-f720f943c789-0 |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: vrutkovs The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
b1385d3 to
6983f0d
Compare
|
/payload-job periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial |
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b4464210-3c57-11f0-8efa-bdf704c2be55-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/023d2050-3c86-11f0-9770-8c3264ee69bc-0 |
6983f0d to
7d73533
Compare
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/795a9cd0-3ca4-11f0-8236-ddb4864c0c81-0 |
7d73533 to
5f18128
Compare
5f18128 to
a5ca6cb
Compare
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/48de3cd0-3d1f-11f0-9dae-2794b1500ac8-0 |
|
/jira refresh |
|
@vrutkovs: This pull request references Jira Issue OCPBUGS-56736, which is valid. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
a5ca6cb to
3fdbc85
Compare
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/2a4a5710-3d6d-11f0-898d-a24e04f40707-0 |
d429563 to
cc4321e
Compare
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a2233090-3f89-11f0-954b-910aecaa9a62-0 |
|
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-aws-ovn-techpreview-serial 10 |
|
@vrutkovs: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/21ccadd0-3f9e-11f0-81cd-6839ae5e2905-0 |
| opts.Preconditions.UID = &projectObj.UID | ||
| opts.Preconditions.ResourceVersion = &projectObj.ResourceVersion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the client already specified either precondition in its request options? We need to check for that and return a conflict response directly instead of ignoring them. Please add a test case for this too.
| } | ||
| return &metav1.Status{Status: metav1.StatusSuccess}, false, s.client.Delete(ctx, name, opts) | ||
| var lastErr error | ||
| err := wait.ExponentialBackoffWithContext(ctx, wait.Backoff{Steps: maxRetriesOnConflict, Duration: maxDuration}, func(ctx context.Context) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding the Godoc for Duration correctly, this means that we will sleep for one second between retries. That seems high to me. I bet it is a lot longer than a typical total latency of both namespace requests combined.
We can configure the other fields for exponential backoff so that initial retry is fairly fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, somehow I though "Duration" is max duration we're allowed to spend. I think wait.Backoff{Steps: maxRetriesOnConflict, Factor: 1/maxRetriesOnConflict, Cap: maxDuration, Duration: maxDuration/maxRetriesOnConflict} would make it "up to 1 second" and ensure it has several retries
| case err == nil: | ||
| return true, nil | ||
| case kerrors.IsConflict(err): | ||
| lastErr = err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests showing that retry happens on conflict, no retry happens on non-conflict, and one where retries are exhausted please?
| if err != nil && wait.ErrorInterrupted(err) != nil { | ||
| return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The response should indicate timeout if our wait loop times out. If you use https://github.com/kubernetes/kubernetes/blob/62f72addf26d2fd25e060554bcd8cf5bdc10e50c/staging/src/k8s.io/apimachinery/pkg/api/errors/errors.go#L364 as the returned error and nil for the returned runtime.Object, does the client see what we want?
| return false, err | ||
| } | ||
| }) | ||
| if err != nil && wait.ErrorInterrupted(err) != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't wait.ErrorInterrupted(err) != nil always be true? I think you meant wait.Interrupted(err).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, updated
| } | ||
| }) | ||
| if err != nil && wait.ErrorInterrupted(err) != nil { | ||
| return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this will plumb non-conflict errors to the client. For example, if the namespace is not found then we should return project not found -- is there a test for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a test for that
cc4321e to
15d0b7d
Compare
|
/test e2e-aws-ovn-serial |
| if err != nil && wait.Interrupted(err) { | ||
| return &metav1.Status{Status: metav1.StatusFailure}, false, lastErr | ||
| } | ||
| return &metav1.Status{Status: metav1.StatusSuccess}, false, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when err != nil && !wait.Interrupted(err)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, missed that part - we should replace err with lastErr is its an interrupt error
| if opts.Preconditions.UID == nil { | ||
| opts.Preconditions.UID = &projectObj.UID | ||
| } | ||
| if opts.Preconditions.ResourceVersion == nil { | ||
| opts.Preconditions.ResourceVersion = &projectObj.ResourceVersion | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to retry conflicts that are caused by client-provided preconditions (they are probably doomed unless the request changes).
If we might have propagated one precondition from the request, and added a second precondition here, it becomes hard to robustly determine which precondition caused a conflict. One way to solve this might be to inspect the fresh namespace returned from Get and enforce any client-provided preconditions immediately. After that, we know that both preconditions passed to the namespace Delete came from this code and that a retry might succeed with a newer UID/RV.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that fetched project matches RV and/or UID before proceeding with delete
…g an object" This reverts commit 28133f9.
e0fc634 to
72c35a5
Compare
72c35a5 to
450c5eb
Compare
|
@vrutkovs: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/retitle OCPBUGS-38120: Improve error messages for project Delete errors |
|
@vrutkovs: This pull request references Jira Issue OCPBUGS-38120, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@vrutkovs: This pull request references Jira Issue OCPBUGS-38120, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
No description provided.