refactor: [shard-distributor]Handle context.Cancelled errors #7493

gazi-yestemirova · 2025-11-26T22:39:05Z

What changed?

executorstore.GetState now returns ctx.Err() when etcd calls are aborted by context cancellation, preventing “get executor data” from being logged as an internal failure.
runRebalancingLoop, runShardStatsCleanupLoop, and rebalanceShardsImpl checks if the err is context.Cancelled or DeadlineExceeded

Why?
context cancellations are expected when leadership is changed or service is stopped, but they are treated as internal errors and it is polluting the logs. So we are ignoring cancellation errors while maintaining the genuine errors visibility.

How did you test it?
unit-tests

Potential risks

Release notes

Documentation Changes

Signed-off-by: Gaziza Yestemirova <[email protected]>

…#7490)  **What changed?** Reverting the trimprefix since we are using constants to compare the values that include that  **Why?** Constants that include the prefix are used to  **How did you test it?** Deployed in staging  **Potential risks** Corruption of db, which is already the case.  **Release notes**  **Documentation Changes** --------- Signed-off-by: edigregorio <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

…hardOwner (cadence-workflow#7476) **What changed?** Changed `GetShardOwner` to return an `ExecutorOwnership` struct containing both executor ID and metadata map, instead of just the executor ID string. Also adds a Spectators group so we can easily pass around all spectators. **Why?** Enables callers to access additional executor information like gRPC address for peer routing, without requiring separate lookups. This is needed for implementing canary peer chooser that routes requests to executors based on their addresses. **How did you test it?** Updated all tests to verify metadata is included in responses. Verified locally that ownership information includes metadata. **Potential risks** Low - this is an API enhancement that maintains backward compatibility by returning the same executor ID, just with additional metadata. **Release notes** **Documentation Changes** None --------- Signed-off-by: Jakob Haahr Taankvist <[email protected]> Signed-off-by: Gaziza Yestemirova <[email protected]>

Signed-off-by: Gaziza Yestemirova <[email protected]>

…re_err_logs

dkrotx · 2025-11-28T16:33:26Z

service/sharddistributor/leader/process/processor.go

+				if isCancelledOrDeadlineExceeded(err) {
+					return
+				}


This is expected to be caught in case <-ctx.Done(): instead.
Which means, if ctx will be Done during the GetState we won't see Shard stats cleanup loop cancelled., which will is very unexpected.

agree it’s a bit counterintuitive at first glance.
In practice the case <-ctx.Done() only fires if the cancellation happens before we enter the branch. In the noisy logs we’re seeing the ticker fire, we drop into the cleanup work, and then the context gets cancelled while GetState is still in flight.
At that point we’re already executing the branch, so the select won’t re-evaluate
the only place we can notice the cancellation is via the error returned from the store.

I added the inline guard so that path exits quietly instead of logging "get executor data: context canceled" since the message is expected any time leadership or shutdown interrupts the iteration.

I understand how it works technically. My point is - this can lead to absense of (which looks like an important) log message of shutting down corresponding to "starting".
We better to preserve that, if it's important. Maybe by logging this outside of select in any way.

I agree, we can either log it also in this case, just to make sure we are not missing the implementation, or go to the loop again, in this case there is a risk that we never hit the first case , it should be minimal but exists.

dkrotx · 2025-11-28T16:36:19Z

service/sharddistributor/store/etcd/executorstore/etcdstore.go

+		if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
+			return nil, ctx.Err()
+		}


I propose moving isCancelledOrDeadlineExceeded as it is a very pattern and re-use it here.

moved IsContextCancellation to store.go

dkrotx · 2025-11-28T16:37:57Z

service/sharddistributor/store/etcd/executorstore/etcdstore.go

+	if ctxErr := ctx.Err(); errors.Is(ctxErr, context.Canceled) || errors.Is(ctxErr, context.DeadlineExceeded) {
+		return nil, ctxErr
+	}


Why this is required? I think it's already handles in the above if err != nil

Even though the call succeeds, the context might be cancelled immediately after Get returns, before we iterate the KVs.
By checking ctx.Err() again after the call, we avoid logging or processing stale data when the caller already abandoned the request.

Though I don't have a strong opinion here, I was just adding extra guards to prevent noisy logging.
If you think this is unnecessary I can remove that part.

It can be cancelled in a middle of the KVs iteration as well.

…re_err_logs

Signed-off-by: Gaziza Yestemirova <[email protected]>

dkrotx · 2025-12-01T12:45:41Z

service/sharddistributor/store/store.go


+// IsContextCancellation reports whether the provided error indicates the caller's context
+// has been cancelled or its deadline has been exceeded.
+func IsContextCancellation(err error) bool {


I think it's better to be some common/utils package
store/ looks like related to persistance, while context cancellation has nothing to do about it in general

dkrotx · 2025-12-01T12:48:57Z

service/sharddistributor/store/etcd/executorstore/etcdstore.go

+	if ctxErr := ctx.Err(); errors.Is(ctxErr, context.Canceled) || errors.Is(ctxErr, context.DeadlineExceeded) {
+		return nil, ctxErr
+	}


It can be cancelled in a middle of the KVs iteration as well.

dkrotx · 2025-12-01T12:53:15Z

service/sharddistributor/leader/process/processor.go

+				if isCancelledOrDeadlineExceeded(err) {
+					return
+				}


I understand how it works technically. My point is - this can lead to absense of (which looks like an important) log message of shutting down corresponding to "starting".
We better to preserve that, if it's important. Maybe by logging this outside of select in any way.

Signed-off-by: Gaziza Yestemirova <[email protected]>

gazi-yestemirova requested review from 3vilhamster, Shaddoll, davidporter-id-au, demirkayaender, dkrotx, jakobht, neil-xie, sankari165, shijiesheng and taylanisikdemir as code owners November 26, 2025 22:39

gazi-yestemirova changed the title ~~Store err logs~~ refactor: [shard-distributor]Remove error logs from store level Nov 28, 2025

gazi-yestemirova and others added 5 commits November 28, 2025 10:44

[shard-distributor]Remove error logs from store level

07c98ce

Signed-off-by: Gaziza Yestemirova <[email protected]>

update tests

6b28caa

Signed-off-by: Gaziza Yestemirova <[email protected]>

handle context cancelled

af3b6fa

Signed-off-by: Gaziza Yestemirova <[email protected]>

gazi-yestemirova force-pushed the store_err_logs branch from 2ab94ef to af3b6fa Compare November 28, 2025 09:45

Merge branch 'master' of github.com:gazi-yestemirova/cadence into sto…

8ff8c25

…re_err_logs

gazi-yestemirova changed the title ~~refactor: [shard-distributor]Remove error logs from store level~~ refactor: [shard-distributor]Handle context.Cancelled errors Nov 28, 2025

dkrotx reviewed Nov 28, 2025

View reviewed changes

gazi-yestemirova added 2 commits December 1, 2025 12:15

Merge branch 'master' of github.com:gazi-yestemirova/cadence into sto…

61d9aac

…re_err_logs

move IsContextCancellation

ffa7b1d

Signed-off-by: Gaziza Yestemirova <[email protected]>

dkrotx reviewed Dec 1, 2025

View reviewed changes

gazi-yestemirova added 2 commits December 1, 2025 15:18

remove etcdstore changes

fcf94ce

Signed-off-by: Gaziza Yestemirova <[email protected]>

rm space

cf23b9f

Signed-off-by: Gaziza Yestemirova <[email protected]>

gazi-yestemirova closed this Dec 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

refactor: [shard-distributor]Handle context.Cancelled errors #7493

refactor: [shard-distributor]Handle context.Cancelled errors #7493

gazi-yestemirova commented Nov 26, 2025 •

edited

Loading

Uh oh!

dkrotx Nov 28, 2025

Uh oh!

gazi-yestemirova Dec 1, 2025

Uh oh!

dkrotx Dec 1, 2025

Uh oh!

eleonoradgr Dec 1, 2025

Uh oh!

dkrotx Nov 28, 2025

Uh oh!

gazi-yestemirova Dec 1, 2025

Uh oh!

dkrotx Nov 28, 2025

Uh oh!

gazi-yestemirova Dec 1, 2025

Uh oh!

dkrotx Dec 1, 2025

Uh oh!

dkrotx Dec 1, 2025

Uh oh!

dkrotx Dec 1, 2025

Uh oh!

dkrotx Dec 1, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

refactor: [shard-distributor]Handle context.Cancelled errors #7493

refactor: [shard-distributor]Handle context.Cancelled errors #7493

Conversation

gazi-yestemirova commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

gazi-yestemirova commented Nov 26, 2025 •

edited

Loading