-
Notifications
You must be signed in to change notification settings - Fork 869
refactor: [shard-distributor]Handle context.Cancelled errors #7493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor: [shard-distributor]Handle context.Cancelled errors #7493
Conversation
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
…#7490) <!-- Describe what has changed in this PR --> **What changed?** Reverting the trimprefix since we are using constants to compare the values that include that <!-- Tell your future self why have you made these changes --> **Why?** Constants that include the prefix are used to <!-- How have you verified this change? Tested locally? Added a unit test? Checked in staging env? --> **How did you test it?** Deployed in staging <!-- Assuming the worst case, what can be broken when deploying this change to production? --> **Potential risks** Corruption of db, which is already the case. <!-- Is it notable for release? e.g. schema updates, configuration or data migration required? If so, please mention it, and also update CHANGELOG.md --> **Release notes** <!-- Is there any documentation updates should be made for config, https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please open an PR in https://github.com/cadence-workflow/cadence-docs --> **Documentation Changes** --------- Signed-off-by: edigregorio <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
…hardOwner (cadence-workflow#7476) **What changed?** Changed `GetShardOwner` to return an `ExecutorOwnership` struct containing both executor ID and metadata map, instead of just the executor ID string. Also adds a Spectators group so we can easily pass around all spectators. **Why?** Enables callers to access additional executor information like gRPC address for peer routing, without requiring separate lookups. This is needed for implementing canary peer chooser that routes requests to executors based on their addresses. **How did you test it?** Updated all tests to verify metadata is included in responses. Verified locally that ownership information includes metadata. **Potential risks** Low - this is an API enhancement that maintains backward compatibility by returning the same executor ID, just with additional metadata. **Release notes** **Documentation Changes** None --------- Signed-off-by: Jakob Haahr Taankvist <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
2ab94ef to
af3b6fa
Compare
| if isCancelledOrDeadlineExceeded(err) { | ||
| return | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is expected to be caught in case <-ctx.Done(): instead.
Which means, if ctx will be Done during the GetState we won't see Shard stats cleanup loop cancelled., which will is very unexpected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree it’s a bit counterintuitive at first glance.
In practice the case <-ctx.Done() only fires if the cancellation happens before we enter the branch. In the noisy logs we’re seeing the ticker fire, we drop into the cleanup work, and then the context gets cancelled while GetState is still in flight.
At that point we’re already executing the branch, so the select won’t re-evaluate
the only place we can notice the cancellation is via the error returned from the store.
I added the inline guard so that path exits quietly instead of logging "get executor data: context canceled" since the message is expected any time leadership or shutdown interrupts the iteration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand how it works technically. My point is - this can lead to absense of (which looks like an important) log message of shutting down corresponding to "starting".
We better to preserve that, if it's important. Maybe by logging this outside of select in any way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, we can either log it also in this case, just to make sure we are not missing the implementation, or go to the loop again, in this case there is a risk that we never hit the first case , it should be minimal but exists.
| if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { | ||
| return nil, ctx.Err() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose moving isCancelledOrDeadlineExceeded as it is a very pattern and re-use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved IsContextCancellation to store.go
| if ctxErr := ctx.Err(); errors.Is(ctxErr, context.Canceled) || errors.Is(ctxErr, context.DeadlineExceeded) { | ||
| return nil, ctxErr | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is required? I think it's already handles in the above if err != nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the call succeeds, the context might be cancelled immediately after Get returns, before we iterate the KVs.
By checking ctx.Err() again after the call, we avoid logging or processing stale data when the caller already abandoned the request.
Though I don't have a strong opinion here, I was just adding extra guards to prevent noisy logging.
If you think this is unnecessary I can remove that part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be cancelled in a middle of the KVs iteration as well.
Signed-off-by: Gaziza Yestemirova <[email protected]>
|
|
||
| // IsContextCancellation reports whether the provided error indicates the caller's context | ||
| // has been cancelled or its deadline has been exceeded. | ||
| func IsContextCancellation(err error) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to be some common/utils package
store/ looks like related to persistance, while context cancellation has nothing to do about it in general
| if ctxErr := ctx.Err(); errors.Is(ctxErr, context.Canceled) || errors.Is(ctxErr, context.DeadlineExceeded) { | ||
| return nil, ctxErr | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be cancelled in a middle of the KVs iteration as well.
| if isCancelledOrDeadlineExceeded(err) { | ||
| return | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand how it works technically. My point is - this can lead to absense of (which looks like an important) log message of shutting down corresponding to "starting".
We better to preserve that, if it's important. Maybe by logging this outside of select in any way.
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
What changed?
Why?
context cancellations are expected when leadership is changed or service is stopped, but they are treated as internal errors and it is polluting the logs. So we are ignoring cancellation errors while maintaining the genuine errors visibility.
How did you test it?
unit-tests
Potential risks
Release notes
Documentation Changes