Skip to content

Conversation

@gazi-yestemirova
Copy link
Contributor

@gazi-yestemirova gazi-yestemirova commented Nov 26, 2025

What changed?

  • executorstore.GetState now returns ctx.Err() when etcd calls are aborted by context cancellation, preventing “get executor data” from being logged as an internal failure.
  • runRebalancingLoop, runShardStatsCleanupLoop, and rebalanceShardsImpl checks if the err is context.Cancelled or DeadlineExceeded

Why?
context cancellations are expected when leadership is changed or service is stopped, but they are treated as internal errors and it is polluting the logs. So we are ignoring cancellation errors while maintaining the genuine errors visibility.

How did you test it?
unit-tests

Potential risks

Release notes

Documentation Changes

@gazi-yestemirova gazi-yestemirova changed the title Store err logs refactor: [shard-distributor]Remove error logs from store level Nov 28, 2025
gazi-yestemirova and others added 5 commits November 28, 2025 10:44
Signed-off-by: Gaziza Yestemirova <[email protected]>
…#7490)

<!-- Describe what has changed in this PR -->
**What changed?**
Reverting the trimprefix since we are using constants to compare the
values that include that

<!-- Tell your future self why have you made these changes -->
**Why?**
Constants that include the prefix are used to

<!-- How have you verified this change? Tested locally? Added a unit
test? Checked in staging env? -->
**How did you test it?**
Deployed in staging

<!-- Assuming the worst case, what can be broken when deploying this
change to production? -->
**Potential risks**
Corruption of db, which is already the case.

<!-- Is it notable for release? e.g. schema updates, configuration or
data migration required? If so, please mention it, and also update
CHANGELOG.md -->
**Release notes**

<!-- Is there any documentation updates should be made for config,
https://cadenceworkflow.io/docs/operation-guide/setup/ ? If so, please
open an PR in https://github.com/cadence-workflow/cadence-docs -->
**Documentation Changes**

---------

Signed-off-by: edigregorio <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
…hardOwner (cadence-workflow#7476)

**What changed?**
Changed `GetShardOwner` to return an `ExecutorOwnership` struct
containing both executor ID and metadata map, instead of just the
executor ID string.
Also adds a Spectators group so we can easily pass around all
spectators.
**Why?**
Enables callers to access additional executor information like gRPC
address for peer routing, without requiring separate lookups. This is
needed for implementing canary peer chooser that routes requests to
executors based on their addresses.

**How did you test it?**
Updated all tests to verify metadata is included in responses. Verified
locally that ownership information includes metadata.

**Potential risks**
Low - this is an API enhancement that maintains backward compatibility
by returning the same executor ID, just with additional metadata.

**Release notes**

**Documentation Changes**
None

---------

Signed-off-by: Jakob Haahr Taankvist <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
@gazi-yestemirova gazi-yestemirova changed the title refactor: [shard-distributor]Remove error logs from store level refactor: [shard-distributor]Handle context.Cancelled errors Nov 28, 2025
Comment on lines 240 to 242
if isCancelledOrDeadlineExceeded(err) {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is expected to be caught in case <-ctx.Done(): instead.
Which means, if ctx will be Done during the GetState we won't see Shard stats cleanup loop cancelled., which will is very unexpected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree it’s a bit counterintuitive at first glance.
In practice the case <-ctx.Done() only fires if the cancellation happens before we enter the branch. In the noisy logs we’re seeing the ticker fire, we drop into the cleanup work, and then the context gets cancelled while GetState is still in flight.
At that point we’re already executing the branch, so the select won’t re-evaluate
the only place we can notice the cancellation is via the error returned from the store.

I added the inline guard so that path exits quietly instead of logging "get executor data: context canceled" since the message is expected any time leadership or shutdown interrupts the iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand how it works technically. My point is - this can lead to absense of (which looks like an important) log message of shutting down corresponding to "starting".
We better to preserve that, if it's important. Maybe by logging this outside of select in any way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, we can either log it also in this case, just to make sure we are not missing the implementation, or go to the loop again, in this case there is a risk that we never hit the first case , it should be minimal but exists.

Comment on lines 236 to 238
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
return nil, ctx.Err()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose moving isCancelledOrDeadlineExceeded as it is a very pattern and re-use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved IsContextCancellation to store.go

Comment on lines 241 to 243
if ctxErr := ctx.Err(); errors.Is(ctxErr, context.Canceled) || errors.Is(ctxErr, context.DeadlineExceeded) {
return nil, ctxErr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is required? I think it's already handles in the above if err != nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the call succeeds, the context might be cancelled immediately after Get returns, before we iterate the KVs.
By checking ctx.Err() again after the call, we avoid logging or processing stale data when the caller already abandoned the request.

Though I don't have a strong opinion here, I was just adding extra guards to prevent noisy logging.
If you think this is unnecessary I can remove that part.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be cancelled in a middle of the KVs iteration as well.


// IsContextCancellation reports whether the provided error indicates the caller's context
// has been cancelled or its deadline has been exceeded.
func IsContextCancellation(err error) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to be some common/utils package
store/ looks like related to persistance, while context cancellation has nothing to do about it in general

Comment on lines 241 to 243
if ctxErr := ctx.Err(); errors.Is(ctxErr, context.Canceled) || errors.Is(ctxErr, context.DeadlineExceeded) {
return nil, ctxErr
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be cancelled in a middle of the KVs iteration as well.

Comment on lines 240 to 242
if isCancelledOrDeadlineExceeded(err) {
return
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand how it works technically. My point is - this can lead to absense of (which looks like an important) log message of shutting down corresponding to "starting".
We better to preserve that, if it's important. Maybe by logging this outside of select in any way.

Signed-off-by: Gaziza Yestemirova <[email protected]>
Signed-off-by: Gaziza Yestemirova <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants