-
Notifications
You must be signed in to change notification settings - Fork 1.1k
[CHASM] Outbound SideEffect task executor #7951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ctx context.Context, | ||
task *tasks.ChasmTask, | ||
) error { | ||
ctx, cancel := context.WithTimeout(ctx, taskTimeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to use a higher timeout, nexus will eventually use chasm too. Maybe let's do 10s here? Outbound queue has a different worker pool implementation so higher timeout should be fine.
We can change that later as well. no strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, updated to 10s.
## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
commit 9f30a1d Author: Lina Jodoin <[email protected]> Date: Wed Aug 27 16:08:12 2025 -0700 [Scheduled Actions] Use the Execution returned from PollMutableState when calling GetWorkflowExecutionHistory (#8207) ## What changed? - In WatchWorkflow, we'll now use the Execution returned as a result from `PollMutableState`, instead of the Execution we used as part of the `PollMutableState` request. ## Why? - We have a likely race condition if a workflow starts and completes during the `PollMutableState` call, where our originally-requested Execution is no longer the latest. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8b717ec Author: Roman Dmytrenko <[email protected]> Date: Wed Aug 27 19:34:05 2025 +0000 chore(deps): upgrade go from 1.24.5 to 1.25.0 (#8209) ## What changed? Upgrade go to the 1.25.0 ## How did you test it? - [x] built - [x] run locally and tested manually ~Blocked by #8174~ --------- Signed-off-by: Roman Dmytrenko <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> commit 3f18d90 Author: Stephan Behnke <[email protected]> Date: Tue Aug 26 18:28:26 2025 -0700 GetWorkflowExecutionHistory long poll soft timeout (#8238) ## What changed? Added a "soft timeout" (language used by Workflow Update) to `GetWorkflowExecutionHistory` long polls. ## Why? We don't want to terminate the long poll connection but instead keep it alive by sending a response back just before the timeout. The idea is that this will prevent connections from opening/terminating repeatedly (ie connection churn). ## How did you test it? - [ ] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks I was only able to verify this manually by forcing a timeout in the server and verifying that instead of a deadline exceeded I saw a result. I'll assume this will work since the existing code already tried doing just exactly that, but it didn't do it well. commit 64884b1 Author: Sean Kane <[email protected]> Date: Tue Aug 26 14:26:02 2025 -0600 fix: handle nil ptr in legacy batch processing (#8244) ## What changed? `BatchWorkflow` is not yet deprecated, fix was not properly applied on previous PR ## Why? nilptr exception ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks na commit 96c3aae Author: Kent Gruber <[email protected]> Date: Tue Aug 26 13:23:40 2025 -0400 Use better string splitting techniques where possible (#8226) ## What changed? This PR aims to avoid usage of [`strings.Split`](https://pkg.go.dev/strings#Split) where possible in favor of better string splitting techniques, speficially: [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) where appropriate. There was also a [`strings.Fields`](https://pkg.go.dev/strings#Fields) change I made to use [`strings.FieldsSeq`](https://pkg.go.dev/strings#FieldsSeq) instead, and another for S3 to use the [`path`](https://pkg.go.dev/path) package instead of [`strings.Split`](https://pkg.go.dev/strings#Split). ## Why? [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) are often better options in many cases, and can be _partially_ detected using [`modernize`](https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize): > `stringsseq`: replace Split in "for range strings.Split(...)" by go1.24's more efficient `SplitSeq`, or `Fields` with `FieldSeq`. ## How did you test it? - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks There are lots of potentially subtle behaviors from the `strings.Split` (and `strings.Fields`) usage that should be accounted for. If our existing tests don't cover those subtleties, there's risk for introducing an unintended bug. More intricate handling/parsing previously using the `strings` package should get extra attention from reviewers. I've attempted to break up my changes into logical commit chunks to aid in review / help spot potentially concerning changes. commit ebcc3fd Author: Sean Kane <[email protected]> Date: Tue Aug 26 09:15:23 2025 -0600 fix: handle nil ptr in batch processing (#8240) ## What changed? Batch workflows were panicking because executions can be nil, but there is no check to prevent nil pointer exception. ## Why? Prevent nil-ptrs ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA commit dfa8b3e Author: Yu Xia <[email protected]> Date: Mon Aug 25 23:40:08 2025 -0700 Adding minimum timeout require in system workflow (#8231) ## What changed? Adding minimum timeout require in system workflow ## Why? The system workflow needs to have sufficient time to execute the defer logic. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8f1ad7c Author: Yu Xia <[email protected]> Date: Mon Aug 25 15:52:12 2025 -0700 Wire up api health monitor component (#8217) ## What changed? Wire up api health monitor component ## Why? The health monitor component did not wire correctly in fx ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 45da266 Author: pdoerner <[email protected]> Date: Mon Aug 25 11:40:10 2025 -0700 Remove dynamic config warnings for shared structures (#8236) ## What changed? Removed warning logs for shared dynamic config structures ## Why? Was failing integration tests commit 2d74130 Author: Hai Zhao <[email protected]> Date: Mon Aug 25 09:12:03 2025 -0700 Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace (#8234) ## What changed? Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace. ## Why? We want to check replication state quickly from cli. ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit 54893bc Author: David Reiss <[email protected]> Date: Fri Aug 22 17:04:29 2025 -0700 Only force-load child partitions after successful initialization (#8230) ## What changed? The force-load child partitions mechanism should only happen after successful initialization of the root. ## Why? If the root fails to load, things can get stuck in a loop where the root loads the children and the children cause the root to be loaded again (from userdata polling). ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 251e20a Author: sivagirish81 <[email protected]> Date: Thu Aug 21 19:08:02 2025 -0700 TaskQueue Fairness Rate Limit (#8135) ## What changed? - Move the rate limit logic for fairness from priMatcher to taskQueuePartitionManagerLevel - Attach the fairness queue rate limit and the per-key rate limit to the simple rate limiter implementation. ## Why? - Implementation of UpdateTaskQueueConfig api for fairness tasks. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks N/A commit 16f7688 Author: Will Duan <[email protected]> Date: Fri Aug 22 07:16:52 2025 +0800 Add log for slow replication tasks (#8225) ## What changed? Log replication task details when processing takes too long ## Why? For operator investigation ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit 934c58d Author: Will Duan <[email protected]> Date: Fri Aug 22 06:51:57 2025 +0800 Fix VerifyVersionedTransition Task (#8227) ## What changed? Fix VerifyVersionedTransition Task ## Why? Without fix, there is risk of success the task without verifying. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit f454f2f Author: Yichao Yang <[email protected]> Date: Thu Aug 21 14:32:35 2025 -0700 Revert history task processing timeout change (#8228) ## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 14d52ce Author: Roman Dmytrenko <[email protected]> Date: Thu Aug 21 15:16:29 2025 +0000 ci: bump golangci-lint from v1.64.8 to v2.4.0 (#8174) ## What changed? Upgrade golangci-lint to v2 ## How did you test it? - [x] run locally and tested manually --------- Signed-off-by: Roman Dmytrenko <[email protected]> commit fb863dc Author: Stephan Behnke <[email protected]> Date: Wed Aug 20 18:19:02 2025 -0700 Explain test build tags and env variables (#7991) ## What changed? Added documentation for test-related build tags and env variables. ## Why? So help developers with their test setup. commit 17c4c07 Author: pdoerner <[email protected]> Date: Wed Aug 20 16:05:45 2025 -0700 Add dynamic config for forwarded Nexus request dispatch type (#8224) ## What changed? Added a new dynamic config to control whether forwarded Nexus HTTP requests should use the same dispatch type as the original request or always use dispatch by namespace + task queue. ## Why? Endpoints do not support replication, so forwarding by endpoint will not work out of the box because the two clusters will have a different ID for the endpoint. commit 1d340f5 Author: pdoerner <[email protected]> Date: Wed Aug 20 15:33:30 2025 -0700 Pass through original HTTP headers for forwarded Nexus requests (#8204) ## What changed? When forwarding Nexus Start/Cancel requests, the original HTTP headers will be passed through without sanitization. ## Why? Some headers that are still needed for the forwarded request may be sanitized during original request processing (e.g. authorization information headers). ## How did you test it? existing tests commit 3b1b8d0 Author: Roey Berman <[email protected]> Date: Wed Aug 20 11:07:49 2025 -0600 Upgrade Go SDK to 1.35.0 (#8216) Had to change WorkerDeploymentOptions and VersioningOverride to work with new SDK. Changed the worker setup for the versioning internal workflow replay tests, so I generated a new set of workflow histories to test with. Normally we generate new workflow histories only when we've made a change to the workflow definitions that we want to test (ie. a new patch), but since the operations to generate the workflow histories had to change slightly, I think it makes sense to generate a fresh set of histories to replay-test with next time there is a change. --------- Co-authored-by: Carly de Frondeville <[email protected]> commit 4d1212a Author: Stephan Behnke <[email protected]> Date: Tue Aug 19 11:52:55 2025 -0700 Fix typo in unprocessedUpdateFailure (#8212) WISOTT commit f38c88a Author: David Reiss <[email protected]> Date: Tue Aug 19 06:51:19 2025 -0700 Allow more retries for matching client polls (#8155) ## What changed? Allow frontend->matching poll requests to retry up to their context timeout instead of just once. ## Why? On matching service deployments, a busy new matching node may hit its persistence rps limit trying to acquire new task queues and be unable to accept polls. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit fdb7f31 Author: Yu Xia <[email protected]> Date: Mon Aug 18 16:24:42 2025 -0700 Change sys background low to use the correct level (#8208) ## What changed? Change sys background low to use the correct level ## Why? Fix this based on the variable name ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit e4b378f Author: David Reiss <[email protected]> Date: Mon Aug 18 15:34:34 2025 -0700 Support subscriptions to settings with constrained defaults (#8180) ## What changed? Fill in support for subscriptions to dynamic config values with constrained defaults. ## Why? We'd like to use this combination of functionality. ## How did you test it? - [x] added new unit test(s) commit 8ed0361 Author: David Reiss <[email protected]> Date: Mon Aug 18 15:29:17 2025 -0700 Allow empty data in DataBlob (#8181) ## What changed? Remove check for zero-length data in NewDataBlob. ## Why? Zero-length data is a valid encoding for some encodings, e.g. proto3. NewDataBlob should not have an opinion on the length of data. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Some code may be making assumptions about this behavior. commit 40ac028 Author: David Reiss <[email protected]> Date: Mon Aug 18 12:41:41 2025 -0700 Warn on dynamic config default values with shared structure (#8176) ## What changed? Log softassert warnings if dynamic config settings are registered with default values with shared structure. ## Why? This is very likely unintended and may lead to unexpected behavior of settings (values will be parsed on top of a copy of the default). ## How did you test it? - [x] run locally and tested manually - [x] added new unit test(s) commit 9c75cd6 Author: pdoerner <[email protected]> Date: Fri Aug 15 12:43:47 2025 -0700 Forward Nexus requests using same dispatch type as original request (#8199) ## What changed? When forwarding Nexus requests that were originally sent to the `DispatchByEndpoint` URL, the forwarding URL will also be constructed to send the request to the `DispatchByEndpoint` URL on the remote cluster. Previously, we were always sending forwarding requests using `DispatchByNamespaceAndTaskQueue` ## Why? bug fix ## How did you test it? existing tests commit 21f556c Author: Roey Berman <[email protected]> Date: Fri Aug 15 13:13:28 2025 -0600 Commit generated scheduler protos (#8200) ## What - Commit generated scheduler protos. - Improve `make ensure-no-changes` to detect untracked files. ## Why? The protos were not generated since the tool was committed in a separate PR from where the protos were added. commit 4c59cd1 Author: Roey Berman <[email protected]> Date: Fri Aug 15 12:09:58 2025 -0600 Add support for protos in chasm libs (#8182) ## What changed? Added support for defining protos in chasm libs. ## Why? Keep everything local to the library. ## How did you test it? - [x] built - [x] run locally and tested manually commit 08e2dfd Author: pdoerner <[email protected]> Date: Thu Aug 14 16:49:44 2025 -0700 Reconstruct failure for forwarded Nexus completion requests (#8198) ## What changed? When forwarding a `CompleteNexusOperation` HTTP request that contains a failure, the completion will be reconstructed instead of reusing the original request body. ## Why? The Nexus SDK reads and closes the HTTP request body when the operation state is `failed` or `canceled` so we cannot reuse it for the forwarded request. For `successful` operations, the SDK just passes on the result content in the form of a `nexus.LazyValue` which we can forward directly since it is not read or closed. ## How did you test it? new functional xdc tests commit 9d82cae Author: Lina Jodoin <[email protected]> Date: Thu Aug 14 16:06:01 2025 -0700 Fix BufferedStart reference in chasm scheduler proto (#8197) ## What changed? _Describe what has changed in this PR._ ## Why? _Tell your future self why have you made these changes._ ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks _Any change is risky. Identify all risks you are aware of. If none, remove this section._ commit 41cce70 Author: Vladyslav Simonenko <[email protected]> Date: Wed Aug 13 15:34:38 2025 -0700 Produce workflow_duration metric on completion (#8185) ## What changed? This PR produces the metric workflow_duration, when the workflow execution completes. ## Why? Currently there is no metric that captures the duration of the workflow execution. It's also valuable to have the duration broken down by task queue, namespace, workflow type, which this PR enables ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) commit a1df862 Author: Lina Jodoin <[email protected]> Date: Wed Aug 13 15:30:48 2025 -0700 [CHASM Scheduler] Move scheduler protobufs to scheduler/proto package (#8189) ## What changed? - CHASM scheduler protos are moved to live alongside the scheduler implementation code, within the `chasm` package. ## Why? - See #8182. Sending this PR in advance, as that PR asserts protobufs were generated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8fe5cee Author: Sean Kane <[email protected]> Date: Thu Aug 14 00:10:06 2025 +0200 improvement: remove waits before fetching activities (#8144) ## What changed? optimize the batch operation processing in `BatchActivity` and `BatchActivityWithProtobuf` by removing the need to wait for entire pages to complete before fetching the next page. - Implemented proactive page fetching once a worker becomes available - common `processWorkflowsWithProactiveFetching` function to reduce code duplication ## Why? The previous implementation had workers wait for entire pages to complete. This optimization improves resource utilization. The refactoring also eliminates duplicated functions in the `BatchParams` struct and `BatchOperation` protobuf. Addresses issue #8098. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) The changes maintain backward compatibility. ## Potential risks While this change improves performance, it does modify the concurrency model of batch processing: 1. **Timing changes**: The optimization changes when pages are fetched relative to task completion, which could expose edge cases in error handling or heartbeat timing 2. **Memory usage**: Pages may be fetched earlier, potentially increasing peak memory usage if the next page is large 3. **Rate limiting interaction**: The more aggressive task scheduling could interact differently with rate limiting, though the same per-worker limits are maintained 4. **Heartbeat behavior**: heartbeats track the progress of an entire page and are applied after an entire page finishes The changes preserve all existing error handling, retry logic, and rate limiting behavior, but the different execution timing could surface previously hidden race conditions. --------- Co-authored-by: Roey Berman <[email protected]> commit 469526e Author: pdoerner <[email protected]> Date: Wed Aug 13 09:58:03 2025 -0700 Change default for `component.nexusoperations.recordCancelRequestCompletionEvents` (#8191) ## What changed? Changed default for `component.nexusoperations.recordCancelRequestCompletionEvents` to `true` ## Why? Flag was added to ensure backwards compatibility. Now that 1.28 is released, can change the default. Flag will be removed after 1.29 is released. commit 69e6b6c Author: Rodrigo Zhou <[email protected]> Date: Tue Aug 12 11:31:27 2025 -0700 Bump Temporal API to v1.52.0 (#8187) ## What changed? Bump Temporal API to v1.52.0 ## Why? Bump Temporal API to v1.52.0 ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit da90f62 Author: Stephan Behnke <[email protected]> Date: Fri Aug 8 14:43:15 2025 -0700 Decode of nil data (#8179) ## What changed? Don't catch `Data: nil` in test; let it fall through to decoder. The decoder will return an error. An error is the better choice than a `nil` response since that signals to the user that the decoded data is usable/valid. ## Why? Follow-up to #8111; an internal test expects an error instead of `nil`. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Hard to believe that returning nil and using un-decoded data is a good/valid alternative. commit b8497fa Author: David Reiss <[email protected]> Date: Fri Aug 8 08:32:58 2025 -0700 Dynamic config conversion improvements (#7052) ## What changed? - Split implementation of "constrained default" settings from "plain default" settings. This is more code and the diff looks complex, but the individual paths are both simpler than the mixed version. - Add conversion cache using a weak map. - Remove GlobalCachedTypedValue. - Use "raw" values for subscription dispatch deduping to avoid unnecessary conversions. - Deep copy default values when using mapstructure, to avoid problems with merging over shared default values. ## Why? - Fixes #6756 - Performance improvement for "plain default" settings (almost all of them) - Performance improvement for settings with complex converters - Remove footgun in defaults that aren't scalar values ## How did you test it? existing+new unit tests commit f9bd083 Author: Lina Jodoin <[email protected]> Date: Thu Aug 7 17:11:53 2025 -0700 [Scheduled Actions] Update Scheduler protos for CHASM (#8163) ## What changed? - Added protos for the new Scheduler task types. - Added TODOs for cleanup when the HSM component is removed. ## Why? - A few fields and messages were made obsolete with the CHASM port. commit f8b97e5 Author: Vladyslav Simonenko <[email protected]> Date: Thu Aug 7 16:24:54 2025 -0700 Break out of pagination in scavenger on errors (#8133) ## What changed? Break out of the loop, when iteration through mutable states fails ## Why? Previously, we continued to iterate, leading to the panic: #8037 ## How did you test it? - [X] run locally and tested manually - [X] added new unit test(s)
commit 9f30a1d Author: Lina Jodoin <[email protected]> Date: Wed Aug 27 16:08:12 2025 -0700 [Scheduled Actions] Use the Execution returned from PollMutableState when calling GetWorkflowExecutionHistory (#8207) ## What changed? - In WatchWorkflow, we'll now use the Execution returned as a result from `PollMutableState`, instead of the Execution we used as part of the `PollMutableState` request. ## Why? - We have a likely race condition if a workflow starts and completes during the `PollMutableState` call, where our originally-requested Execution is no longer the latest. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8b717ec Author: Roman Dmytrenko <[email protected]> Date: Wed Aug 27 19:34:05 2025 +0000 chore(deps): upgrade go from 1.24.5 to 1.25.0 (#8209) ## What changed? Upgrade go to the 1.25.0 ## How did you test it? - [x] built - [x] run locally and tested manually ~Blocked by #8174~ --------- Signed-off-by: Roman Dmytrenko <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> commit 3f18d90 Author: Stephan Behnke <[email protected]> Date: Tue Aug 26 18:28:26 2025 -0700 GetWorkflowExecutionHistory long poll soft timeout (#8238) ## What changed? Added a "soft timeout" (language used by Workflow Update) to `GetWorkflowExecutionHistory` long polls. ## Why? We don't want to terminate the long poll connection but instead keep it alive by sending a response back just before the timeout. The idea is that this will prevent connections from opening/terminating repeatedly (ie connection churn). ## How did you test it? - [ ] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks I was only able to verify this manually by forcing a timeout in the server and verifying that instead of a deadline exceeded I saw a result. I'll assume this will work since the existing code already tried doing just exactly that, but it didn't do it well. commit 64884b1 Author: Sean Kane <[email protected]> Date: Tue Aug 26 14:26:02 2025 -0600 fix: handle nil ptr in legacy batch processing (#8244) ## What changed? `BatchWorkflow` is not yet deprecated, fix was not properly applied on previous PR ## Why? nilptr exception ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks na commit 96c3aae Author: Kent Gruber <[email protected]> Date: Tue Aug 26 13:23:40 2025 -0400 Use better string splitting techniques where possible (#8226) ## What changed? This PR aims to avoid usage of [`strings.Split`](https://pkg.go.dev/strings#Split) where possible in favor of better string splitting techniques, speficially: [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) where appropriate. There was also a [`strings.Fields`](https://pkg.go.dev/strings#Fields) change I made to use [`strings.FieldsSeq`](https://pkg.go.dev/strings#FieldsSeq) instead, and another for S3 to use the [`path`](https://pkg.go.dev/path) package instead of [`strings.Split`](https://pkg.go.dev/strings#Split). ## Why? [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) are often better options in many cases, and can be _partially_ detected using [`modernize`](https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize): > `stringsseq`: replace Split in "for range strings.Split(...)" by go1.24's more efficient `SplitSeq`, or `Fields` with `FieldSeq`. ## How did you test it? - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks There are lots of potentially subtle behaviors from the `strings.Split` (and `strings.Fields`) usage that should be accounted for. If our existing tests don't cover those subtleties, there's risk for introducing an unintended bug. More intricate handling/parsing previously using the `strings` package should get extra attention from reviewers. I've attempted to break up my changes into logical commit chunks to aid in review / help spot potentially concerning changes. commit ebcc3fd Author: Sean Kane <[email protected]> Date: Tue Aug 26 09:15:23 2025 -0600 fix: handle nil ptr in batch processing (#8240) ## What changed? Batch workflows were panicking because executions can be nil, but there is no check to prevent nil pointer exception. ## Why? Prevent nil-ptrs ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA commit dfa8b3e Author: Yu Xia <[email protected]> Date: Mon Aug 25 23:40:08 2025 -0700 Adding minimum timeout require in system workflow (#8231) ## What changed? Adding minimum timeout require in system workflow ## Why? The system workflow needs to have sufficient time to execute the defer logic. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8f1ad7c Author: Yu Xia <[email protected]> Date: Mon Aug 25 15:52:12 2025 -0700 Wire up api health monitor component (#8217) ## What changed? Wire up api health monitor component ## Why? The health monitor component did not wire correctly in fx ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 45da266 Author: pdoerner <[email protected]> Date: Mon Aug 25 11:40:10 2025 -0700 Remove dynamic config warnings for shared structures (#8236) ## What changed? Removed warning logs for shared dynamic config structures ## Why? Was failing integration tests commit 2d74130 Author: Hai Zhao <[email protected]> Date: Mon Aug 25 09:12:03 2025 -0700 Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace (#8234) ## What changed? Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace. ## Why? We want to check replication state quickly from cli. ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit 54893bc Author: David Reiss <[email protected]> Date: Fri Aug 22 17:04:29 2025 -0700 Only force-load child partitions after successful initialization (#8230) ## What changed? The force-load child partitions mechanism should only happen after successful initialization of the root. ## Why? If the root fails to load, things can get stuck in a loop where the root loads the children and the children cause the root to be loaded again (from userdata polling). ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 251e20a Author: sivagirish81 <[email protected]> Date: Thu Aug 21 19:08:02 2025 -0700 TaskQueue Fairness Rate Limit (#8135) ## What changed? - Move the rate limit logic for fairness from priMatcher to taskQueuePartitionManagerLevel - Attach the fairness queue rate limit and the per-key rate limit to the simple rate limiter implementation. ## Why? - Implementation of UpdateTaskQueueConfig api for fairness tasks. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks N/A commit 16f7688 Author: Will Duan <[email protected]> Date: Fri Aug 22 07:16:52 2025 +0800 Add log for slow replication tasks (#8225) ## What changed? Log replication task details when processing takes too long ## Why? For operator investigation ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit 934c58d Author: Will Duan <[email protected]> Date: Fri Aug 22 06:51:57 2025 +0800 Fix VerifyVersionedTransition Task (#8227) ## What changed? Fix VerifyVersionedTransition Task ## Why? Without fix, there is risk of success the task without verifying. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit f454f2f Author: Yichao Yang <[email protected]> Date: Thu Aug 21 14:32:35 2025 -0700 Revert history task processing timeout change (#8228) ## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 14d52ce Author: Roman Dmytrenko <[email protected]> Date: Thu Aug 21 15:16:29 2025 +0000 ci: bump golangci-lint from v1.64.8 to v2.4.0 (#8174) ## What changed? Upgrade golangci-lint to v2 ## How did you test it? - [x] run locally and tested manually --------- Signed-off-by: Roman Dmytrenko <[email protected]> commit fb863dc Author: Stephan Behnke <[email protected]> Date: Wed Aug 20 18:19:02 2025 -0700 Explain test build tags and env variables (#7991) ## What changed? Added documentation for test-related build tags and env variables. ## Why? So help developers with their test setup. commit 17c4c07 Author: pdoerner <[email protected]> Date: Wed Aug 20 16:05:45 2025 -0700 Add dynamic config for forwarded Nexus request dispatch type (#8224) ## What changed? Added a new dynamic config to control whether forwarded Nexus HTTP requests should use the same dispatch type as the original request or always use dispatch by namespace + task queue. ## Why? Endpoints do not support replication, so forwarding by endpoint will not work out of the box because the two clusters will have a different ID for the endpoint. commit 1d340f5 Author: pdoerner <[email protected]> Date: Wed Aug 20 15:33:30 2025 -0700 Pass through original HTTP headers for forwarded Nexus requests (#8204) ## What changed? When forwarding Nexus Start/Cancel requests, the original HTTP headers will be passed through without sanitization. ## Why? Some headers that are still needed for the forwarded request may be sanitized during original request processing (e.g. authorization information headers). ## How did you test it? existing tests commit 3b1b8d0 Author: Roey Berman <[email protected]> Date: Wed Aug 20 11:07:49 2025 -0600 Upgrade Go SDK to 1.35.0 (#8216) Had to change WorkerDeploymentOptions and VersioningOverride to work with new SDK. Changed the worker setup for the versioning internal workflow replay tests, so I generated a new set of workflow histories to test with. Normally we generate new workflow histories only when we've made a change to the workflow definitions that we want to test (ie. a new patch), but since the operations to generate the workflow histories had to change slightly, I think it makes sense to generate a fresh set of histories to replay-test with next time there is a change. --------- Co-authored-by: Carly de Frondeville <[email protected]> commit 4d1212a Author: Stephan Behnke <[email protected]> Date: Tue Aug 19 11:52:55 2025 -0700 Fix typo in unprocessedUpdateFailure (#8212) WISOTT commit f38c88a Author: David Reiss <[email protected]> Date: Tue Aug 19 06:51:19 2025 -0700 Allow more retries for matching client polls (#8155) ## What changed? Allow frontend->matching poll requests to retry up to their context timeout instead of just once. ## Why? On matching service deployments, a busy new matching node may hit its persistence rps limit trying to acquire new task queues and be unable to accept polls. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit fdb7f31 Author: Yu Xia <[email protected]> Date: Mon Aug 18 16:24:42 2025 -0700 Change sys background low to use the correct level (#8208) ## What changed? Change sys background low to use the correct level ## Why? Fix this based on the variable name ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit e4b378f Author: David Reiss <[email protected]> Date: Mon Aug 18 15:34:34 2025 -0700 Support subscriptions to settings with constrained defaults (#8180) ## What changed? Fill in support for subscriptions to dynamic config values with constrained defaults. ## Why? We'd like to use this combination of functionality. ## How did you test it? - [x] added new unit test(s) commit 8ed0361 Author: David Reiss <[email protected]> Date: Mon Aug 18 15:29:17 2025 -0700 Allow empty data in DataBlob (#8181) ## What changed? Remove check for zero-length data in NewDataBlob. ## Why? Zero-length data is a valid encoding for some encodings, e.g. proto3. NewDataBlob should not have an opinion on the length of data. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Some code may be making assumptions about this behavior. commit 40ac028 Author: David Reiss <[email protected]> Date: Mon Aug 18 12:41:41 2025 -0700 Warn on dynamic config default values with shared structure (#8176) ## What changed? Log softassert warnings if dynamic config settings are registered with default values with shared structure. ## Why? This is very likely unintended and may lead to unexpected behavior of settings (values will be parsed on top of a copy of the default). ## How did you test it? - [x] run locally and tested manually - [x] added new unit test(s) commit 9c75cd6 Author: pdoerner <[email protected]> Date: Fri Aug 15 12:43:47 2025 -0700 Forward Nexus requests using same dispatch type as original request (#8199) ## What changed? When forwarding Nexus requests that were originally sent to the `DispatchByEndpoint` URL, the forwarding URL will also be constructed to send the request to the `DispatchByEndpoint` URL on the remote cluster. Previously, we were always sending forwarding requests using `DispatchByNamespaceAndTaskQueue` ## Why? bug fix ## How did you test it? existing tests commit 21f556c Author: Roey Berman <[email protected]> Date: Fri Aug 15 13:13:28 2025 -0600 Commit generated scheduler protos (#8200) ## What - Commit generated scheduler protos. - Improve `make ensure-no-changes` to detect untracked files. ## Why? The protos were not generated since the tool was committed in a separate PR from where the protos were added. commit 4c59cd1 Author: Roey Berman <[email protected]> Date: Fri Aug 15 12:09:58 2025 -0600 Add support for protos in chasm libs (#8182) ## What changed? Added support for defining protos in chasm libs. ## Why? Keep everything local to the library. ## How did you test it? - [x] built - [x] run locally and tested manually commit 08e2dfd Author: pdoerner <[email protected]> Date: Thu Aug 14 16:49:44 2025 -0700 Reconstruct failure for forwarded Nexus completion requests (#8198) ## What changed? When forwarding a `CompleteNexusOperation` HTTP request that contains a failure, the completion will be reconstructed instead of reusing the original request body. ## Why? The Nexus SDK reads and closes the HTTP request body when the operation state is `failed` or `canceled` so we cannot reuse it for the forwarded request. For `successful` operations, the SDK just passes on the result content in the form of a `nexus.LazyValue` which we can forward directly since it is not read or closed. ## How did you test it? new functional xdc tests commit 9d82cae Author: Lina Jodoin <[email protected]> Date: Thu Aug 14 16:06:01 2025 -0700 Fix BufferedStart reference in chasm scheduler proto (#8197) ## What changed? _Describe what has changed in this PR._ ## Why? _Tell your future self why have you made these changes._ ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks _Any change is risky. Identify all risks you are aware of. If none, remove this section._ commit 41cce70 Author: Vladyslav Simonenko <[email protected]> Date: Wed Aug 13 15:34:38 2025 -0700 Produce workflow_duration metric on completion (#8185) ## What changed? This PR produces the metric workflow_duration, when the workflow execution completes. ## Why? Currently there is no metric that captures the duration of the workflow execution. It's also valuable to have the duration broken down by task queue, namespace, workflow type, which this PR enables ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) commit a1df862 Author: Lina Jodoin <[email protected]> Date: Wed Aug 13 15:30:48 2025 -0700 [CHASM Scheduler] Move scheduler protobufs to scheduler/proto package (#8189) ## What changed? - CHASM scheduler protos are moved to live alongside the scheduler implementation code, within the `chasm` package. ## Why? - See #8182. Sending this PR in advance, as that PR asserts protobufs were generated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8fe5cee Author: Sean Kane <[email protected]> Date: Thu Aug 14 00:10:06 2025 +0200 improvement: remove waits before fetching activities (#8144) ## What changed? optimize the batch operation processing in `BatchActivity` and `BatchActivityWithProtobuf` by removing the need to wait for entire pages to complete before fetching the next page. - Implemented proactive page fetching once a worker becomes available - common `processWorkflowsWithProactiveFetching` function to reduce code duplication ## Why? The previous implementation had workers wait for entire pages to complete. This optimization improves resource utilization. The refactoring also eliminates duplicated functions in the `BatchParams` struct and `BatchOperation` protobuf. Addresses issue #8098. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) The changes maintain backward compatibility. ## Potential risks While this change improves performance, it does modify the concurrency model of batch processing: 1. **Timing changes**: The optimization changes when pages are fetched relative to task completion, which could expose edge cases in error handling or heartbeat timing 2. **Memory usage**: Pages may be fetched earlier, potentially increasing peak memory usage if the next page is large 3. **Rate limiting interaction**: The more aggressive task scheduling could interact differently with rate limiting, though the same per-worker limits are maintained 4. **Heartbeat behavior**: heartbeats track the progress of an entire page and are applied after an entire page finishes The changes preserve all existing error handling, retry logic, and rate limiting behavior, but the different execution timing could surface previously hidden race conditions. --------- Co-authored-by: Roey Berman <[email protected]> commit 469526e Author: pdoerner <[email protected]> Date: Wed Aug 13 09:58:03 2025 -0700 Change default for `component.nexusoperations.recordCancelRequestCompletionEvents` (#8191) ## What changed? Changed default for `component.nexusoperations.recordCancelRequestCompletionEvents` to `true` ## Why? Flag was added to ensure backwards compatibility. Now that 1.28 is released, can change the default. Flag will be removed after 1.29 is released. commit 69e6b6c Author: Rodrigo Zhou <[email protected]> Date: Tue Aug 12 11:31:27 2025 -0700 Bump Temporal API to v1.52.0 (#8187) ## What changed? Bump Temporal API to v1.52.0 ## Why? Bump Temporal API to v1.52.0 ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit da90f62 Author: Stephan Behnke <[email protected]> Date: Fri Aug 8 14:43:15 2025 -0700 Decode of nil data (#8179) ## What changed? Don't catch `Data: nil` in test; let it fall through to decoder. The decoder will return an error. An error is the better choice than a `nil` response since that signals to the user that the decoded data is usable/valid. ## Why? Follow-up to #8111; an internal test expects an error instead of `nil`. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Hard to believe that returning nil and using un-decoded data is a good/valid alternative. commit b8497fa Author: David Reiss <[email protected]> Date: Fri Aug 8 08:32:58 2025 -0700 Dynamic config conversion improvements (#7052) ## What changed? - Split implementation of "constrained default" settings from "plain default" settings. This is more code and the diff looks complex, but the individual paths are both simpler than the mixed version. - Add conversion cache using a weak map. - Remove GlobalCachedTypedValue. - Use "raw" values for subscription dispatch deduping to avoid unnecessary conversions. - Deep copy default values when using mapstructure, to avoid problems with merging over shared default values. ## Why? - Fixes #6756 - Performance improvement for "plain default" settings (almost all of them) - Performance improvement for settings with complex converters - Remove footgun in defaults that aren't scalar values ## How did you test it? existing+new unit tests commit f9bd083 Author: Lina Jodoin <[email protected]> Date: Thu Aug 7 17:11:53 2025 -0700 [Scheduled Actions] Update Scheduler protos for CHASM (#8163) ## What changed? - Added protos for the new Scheduler task types. - Added TODOs for cleanup when the HSM component is removed. ## Why? - A few fields and messages were made obsolete with the CHASM port. commit f8b97e5 Author: Vladyslav Simonenko <[email protected]> Date: Thu Aug 7 16:24:54 2025 -0700 Break out of pagination in scavenger on errors (#8133) ## What changed? Break out of the loop, when iteration through mutable states fails ## Why? Previously, we continued to iterate, leading to the panic: #8037 ## How did you test it? - [X] run locally and tested manually - [X] added new unit test(s)
commit 9f30a1d Author: Lina Jodoin <[email protected]> Date: Wed Aug 27 16:08:12 2025 -0700 [Scheduled Actions] Use the Execution returned from PollMutableState when calling GetWorkflowExecutionHistory (#8207) ## What changed? - In WatchWorkflow, we'll now use the Execution returned as a result from `PollMutableState`, instead of the Execution we used as part of the `PollMutableState` request. ## Why? - We have a likely race condition if a workflow starts and completes during the `PollMutableState` call, where our originally-requested Execution is no longer the latest. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8b717ec Author: Roman Dmytrenko <[email protected]> Date: Wed Aug 27 19:34:05 2025 +0000 chore(deps): upgrade go from 1.24.5 to 1.25.0 (#8209) ## What changed? Upgrade go to the 1.25.0 ## How did you test it? - [x] built - [x] run locally and tested manually ~Blocked by #8174~ --------- Signed-off-by: Roman Dmytrenko <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> commit 3f18d90 Author: Stephan Behnke <[email protected]> Date: Tue Aug 26 18:28:26 2025 -0700 GetWorkflowExecutionHistory long poll soft timeout (#8238) ## What changed? Added a "soft timeout" (language used by Workflow Update) to `GetWorkflowExecutionHistory` long polls. ## Why? We don't want to terminate the long poll connection but instead keep it alive by sending a response back just before the timeout. The idea is that this will prevent connections from opening/terminating repeatedly (ie connection churn). ## How did you test it? - [ ] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks I was only able to verify this manually by forcing a timeout in the server and verifying that instead of a deadline exceeded I saw a result. I'll assume this will work since the existing code already tried doing just exactly that, but it didn't do it well. commit 64884b1 Author: Sean Kane <[email protected]> Date: Tue Aug 26 14:26:02 2025 -0600 fix: handle nil ptr in legacy batch processing (#8244) ## What changed? `BatchWorkflow` is not yet deprecated, fix was not properly applied on previous PR ## Why? nilptr exception ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks na commit 96c3aae Author: Kent Gruber <[email protected]> Date: Tue Aug 26 13:23:40 2025 -0400 Use better string splitting techniques where possible (#8226) ## What changed? This PR aims to avoid usage of [`strings.Split`](https://pkg.go.dev/strings#Split) where possible in favor of better string splitting techniques, speficially: [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) where appropriate. There was also a [`strings.Fields`](https://pkg.go.dev/strings#Fields) change I made to use [`strings.FieldsSeq`](https://pkg.go.dev/strings#FieldsSeq) instead, and another for S3 to use the [`path`](https://pkg.go.dev/path) package instead of [`strings.Split`](https://pkg.go.dev/strings#Split). ## Why? [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) are often better options in many cases, and can be _partially_ detected using [`modernize`](https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize): > `stringsseq`: replace Split in "for range strings.Split(...)" by go1.24's more efficient `SplitSeq`, or `Fields` with `FieldSeq`. ## How did you test it? - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks There are lots of potentially subtle behaviors from the `strings.Split` (and `strings.Fields`) usage that should be accounted for. If our existing tests don't cover those subtleties, there's risk for introducing an unintended bug. More intricate handling/parsing previously using the `strings` package should get extra attention from reviewers. I've attempted to break up my changes into logical commit chunks to aid in review / help spot potentially concerning changes. commit ebcc3fd Author: Sean Kane <[email protected]> Date: Tue Aug 26 09:15:23 2025 -0600 fix: handle nil ptr in batch processing (#8240) ## What changed? Batch workflows were panicking because executions can be nil, but there is no check to prevent nil pointer exception. ## Why? Prevent nil-ptrs ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA commit dfa8b3e Author: Yu Xia <[email protected]> Date: Mon Aug 25 23:40:08 2025 -0700 Adding minimum timeout require in system workflow (#8231) ## What changed? Adding minimum timeout require in system workflow ## Why? The system workflow needs to have sufficient time to execute the defer logic. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8f1ad7c Author: Yu Xia <[email protected]> Date: Mon Aug 25 15:52:12 2025 -0700 Wire up api health monitor component (#8217) ## What changed? Wire up api health monitor component ## Why? The health monitor component did not wire correctly in fx ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 45da266 Author: pdoerner <[email protected]> Date: Mon Aug 25 11:40:10 2025 -0700 Remove dynamic config warnings for shared structures (#8236) ## What changed? Removed warning logs for shared dynamic config structures ## Why? Was failing integration tests commit 2d74130 Author: Hai Zhao <[email protected]> Date: Mon Aug 25 09:12:03 2025 -0700 Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace (#8234) ## What changed? Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace. ## Why? We want to check replication state quickly from cli. ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit 54893bc Author: David Reiss <[email protected]> Date: Fri Aug 22 17:04:29 2025 -0700 Only force-load child partitions after successful initialization (#8230) ## What changed? The force-load child partitions mechanism should only happen after successful initialization of the root. ## Why? If the root fails to load, things can get stuck in a loop where the root loads the children and the children cause the root to be loaded again (from userdata polling). ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 251e20a Author: sivagirish81 <[email protected]> Date: Thu Aug 21 19:08:02 2025 -0700 TaskQueue Fairness Rate Limit (#8135) ## What changed? - Move the rate limit logic for fairness from priMatcher to taskQueuePartitionManagerLevel - Attach the fairness queue rate limit and the per-key rate limit to the simple rate limiter implementation. ## Why? - Implementation of UpdateTaskQueueConfig api for fairness tasks. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks N/A commit 16f7688 Author: Will Duan <[email protected]> Date: Fri Aug 22 07:16:52 2025 +0800 Add log for slow replication tasks (#8225) ## What changed? Log replication task details when processing takes too long ## Why? For operator investigation ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit 934c58d Author: Will Duan <[email protected]> Date: Fri Aug 22 06:51:57 2025 +0800 Fix VerifyVersionedTransition Task (#8227) ## What changed? Fix VerifyVersionedTransition Task ## Why? Without fix, there is risk of success the task without verifying. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit f454f2f Author: Yichao Yang <[email protected]> Date: Thu Aug 21 14:32:35 2025 -0700 Revert history task processing timeout change (#8228) ## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 14d52ce Author: Roman Dmytrenko <[email protected]> Date: Thu Aug 21 15:16:29 2025 +0000 ci: bump golangci-lint from v1.64.8 to v2.4.0 (#8174) ## What changed? Upgrade golangci-lint to v2 ## How did you test it? - [x] run locally and tested manually --------- Signed-off-by: Roman Dmytrenko <[email protected]> commit fb863dc Author: Stephan Behnke <[email protected]> Date: Wed Aug 20 18:19:02 2025 -0700 Explain test build tags and env variables (#7991) ## What changed? Added documentation for test-related build tags and env variables. ## Why? So help developers with their test setup. commit 17c4c07 Author: pdoerner <[email protected]> Date: Wed Aug 20 16:05:45 2025 -0700 Add dynamic config for forwarded Nexus request dispatch type (#8224) ## What changed? Added a new dynamic config to control whether forwarded Nexus HTTP requests should use the same dispatch type as the original request or always use dispatch by namespace + task queue. ## Why? Endpoints do not support replication, so forwarding by endpoint will not work out of the box because the two clusters will have a different ID for the endpoint. commit 1d340f5 Author: pdoerner <[email protected]> Date: Wed Aug 20 15:33:30 2025 -0700 Pass through original HTTP headers for forwarded Nexus requests (#8204) ## What changed? When forwarding Nexus Start/Cancel requests, the original HTTP headers will be passed through without sanitization. ## Why? Some headers that are still needed for the forwarded request may be sanitized during original request processing (e.g. authorization information headers). ## How did you test it? existing tests commit 3b1b8d0 Author: Roey Berman <[email protected]> Date: Wed Aug 20 11:07:49 2025 -0600 Upgrade Go SDK to 1.35.0 (#8216) Had to change WorkerDeploymentOptions and VersioningOverride to work with new SDK. Changed the worker setup for the versioning internal workflow replay tests, so I generated a new set of workflow histories to test with. Normally we generate new workflow histories only when we've made a change to the workflow definitions that we want to test (ie. a new patch), but since the operations to generate the workflow histories had to change slightly, I think it makes sense to generate a fresh set of histories to replay-test with next time there is a change. --------- Co-authored-by: Carly de Frondeville <[email protected]> commit 4d1212a Author: Stephan Behnke <[email protected]> Date: Tue Aug 19 11:52:55 2025 -0700 Fix typo in unprocessedUpdateFailure (#8212) WISOTT commit f38c88a Author: David Reiss <[email protected]> Date: Tue Aug 19 06:51:19 2025 -0700 Allow more retries for matching client polls (#8155) ## What changed? Allow frontend->matching poll requests to retry up to their context timeout instead of just once. ## Why? On matching service deployments, a busy new matching node may hit its persistence rps limit trying to acquire new task queues and be unable to accept polls. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit fdb7f31 Author: Yu Xia <[email protected]> Date: Mon Aug 18 16:24:42 2025 -0700 Change sys background low to use the correct level (#8208) ## What changed? Change sys background low to use the correct level ## Why? Fix this based on the variable name ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit e4b378f Author: David Reiss <[email protected]> Date: Mon Aug 18 15:34:34 2025 -0700 Support subscriptions to settings with constrained defaults (#8180) ## What changed? Fill in support for subscriptions to dynamic config values with constrained defaults. ## Why? We'd like to use this combination of functionality. ## How did you test it? - [x] added new unit test(s) commit 8ed0361 Author: David Reiss <[email protected]> Date: Mon Aug 18 15:29:17 2025 -0700 Allow empty data in DataBlob (#8181) ## What changed? Remove check for zero-length data in NewDataBlob. ## Why? Zero-length data is a valid encoding for some encodings, e.g. proto3. NewDataBlob should not have an opinion on the length of data. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Some code may be making assumptions about this behavior. commit 40ac028 Author: David Reiss <[email protected]> Date: Mon Aug 18 12:41:41 2025 -0700 Warn on dynamic config default values with shared structure (#8176) ## What changed? Log softassert warnings if dynamic config settings are registered with default values with shared structure. ## Why? This is very likely unintended and may lead to unexpected behavior of settings (values will be parsed on top of a copy of the default). ## How did you test it? - [x] run locally and tested manually - [x] added new unit test(s) commit 9c75cd6 Author: pdoerner <[email protected]> Date: Fri Aug 15 12:43:47 2025 -0700 Forward Nexus requests using same dispatch type as original request (#8199) ## What changed? When forwarding Nexus requests that were originally sent to the `DispatchByEndpoint` URL, the forwarding URL will also be constructed to send the request to the `DispatchByEndpoint` URL on the remote cluster. Previously, we were always sending forwarding requests using `DispatchByNamespaceAndTaskQueue` ## Why? bug fix ## How did you test it? existing tests commit 21f556c Author: Roey Berman <[email protected]> Date: Fri Aug 15 13:13:28 2025 -0600 Commit generated scheduler protos (#8200) ## What - Commit generated scheduler protos. - Improve `make ensure-no-changes` to detect untracked files. ## Why? The protos were not generated since the tool was committed in a separate PR from where the protos were added. commit 4c59cd1 Author: Roey Berman <[email protected]> Date: Fri Aug 15 12:09:58 2025 -0600 Add support for protos in chasm libs (#8182) ## What changed? Added support for defining protos in chasm libs. ## Why? Keep everything local to the library. ## How did you test it? - [x] built - [x] run locally and tested manually commit 08e2dfd Author: pdoerner <[email protected]> Date: Thu Aug 14 16:49:44 2025 -0700 Reconstruct failure for forwarded Nexus completion requests (#8198) ## What changed? When forwarding a `CompleteNexusOperation` HTTP request that contains a failure, the completion will be reconstructed instead of reusing the original request body. ## Why? The Nexus SDK reads and closes the HTTP request body when the operation state is `failed` or `canceled` so we cannot reuse it for the forwarded request. For `successful` operations, the SDK just passes on the result content in the form of a `nexus.LazyValue` which we can forward directly since it is not read or closed. ## How did you test it? new functional xdc tests commit 9d82cae Author: Lina Jodoin <[email protected]> Date: Thu Aug 14 16:06:01 2025 -0700 Fix BufferedStart reference in chasm scheduler proto (#8197) ## What changed? _Describe what has changed in this PR._ ## Why? _Tell your future self why have you made these changes._ ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks _Any change is risky. Identify all risks you are aware of. If none, remove this section._ commit 41cce70 Author: Vladyslav Simonenko <[email protected]> Date: Wed Aug 13 15:34:38 2025 -0700 Produce workflow_duration metric on completion (#8185) ## What changed? This PR produces the metric workflow_duration, when the workflow execution completes. ## Why? Currently there is no metric that captures the duration of the workflow execution. It's also valuable to have the duration broken down by task queue, namespace, workflow type, which this PR enables ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) commit a1df862 Author: Lina Jodoin <[email protected]> Date: Wed Aug 13 15:30:48 2025 -0700 [CHASM Scheduler] Move scheduler protobufs to scheduler/proto package (#8189) ## What changed? - CHASM scheduler protos are moved to live alongside the scheduler implementation code, within the `chasm` package. ## Why? - See #8182. Sending this PR in advance, as that PR asserts protobufs were generated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8fe5cee Author: Sean Kane <[email protected]> Date: Thu Aug 14 00:10:06 2025 +0200 improvement: remove waits before fetching activities (#8144) ## What changed? optimize the batch operation processing in `BatchActivity` and `BatchActivityWithProtobuf` by removing the need to wait for entire pages to complete before fetching the next page. - Implemented proactive page fetching once a worker becomes available - common `processWorkflowsWithProactiveFetching` function to reduce code duplication ## Why? The previous implementation had workers wait for entire pages to complete. This optimization improves resource utilization. The refactoring also eliminates duplicated functions in the `BatchParams` struct and `BatchOperation` protobuf. Addresses issue #8098. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) The changes maintain backward compatibility. ## Potential risks While this change improves performance, it does modify the concurrency model of batch processing: 1. **Timing changes**: The optimization changes when pages are fetched relative to task completion, which could expose edge cases in error handling or heartbeat timing 2. **Memory usage**: Pages may be fetched earlier, potentially increasing peak memory usage if the next page is large 3. **Rate limiting interaction**: The more aggressive task scheduling could interact differently with rate limiting, though the same per-worker limits are maintained 4. **Heartbeat behavior**: heartbeats track the progress of an entire page and are applied after an entire page finishes The changes preserve all existing error handling, retry logic, and rate limiting behavior, but the different execution timing could surface previously hidden race conditions. --------- Co-authored-by: Roey Berman <[email protected]> commit 469526e Author: pdoerner <[email protected]> Date: Wed Aug 13 09:58:03 2025 -0700 Change default for `component.nexusoperations.recordCancelRequestCompletionEvents` (#8191) ## What changed? Changed default for `component.nexusoperations.recordCancelRequestCompletionEvents` to `true` ## Why? Flag was added to ensure backwards compatibility. Now that 1.28 is released, can change the default. Flag will be removed after 1.29 is released. commit 69e6b6c Author: Rodrigo Zhou <[email protected]> Date: Tue Aug 12 11:31:27 2025 -0700 Bump Temporal API to v1.52.0 (#8187) ## What changed? Bump Temporal API to v1.52.0 ## Why? Bump Temporal API to v1.52.0 ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit da90f62 Author: Stephan Behnke <[email protected]> Date: Fri Aug 8 14:43:15 2025 -0700 Decode of nil data (#8179) ## What changed? Don't catch `Data: nil` in test; let it fall through to decoder. The decoder will return an error. An error is the better choice than a `nil` response since that signals to the user that the decoded data is usable/valid. ## Why? Follow-up to #8111; an internal test expects an error instead of `nil`. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Hard to believe that returning nil and using un-decoded data is a good/valid alternative. commit b8497fa Author: David Reiss <[email protected]> Date: Fri Aug 8 08:32:58 2025 -0700 Dynamic config conversion improvements (#7052) ## What changed? - Split implementation of "constrained default" settings from "plain default" settings. This is more code and the diff looks complex, but the individual paths are both simpler than the mixed version. - Add conversion cache using a weak map. - Remove GlobalCachedTypedValue. - Use "raw" values for subscription dispatch deduping to avoid unnecessary conversions. - Deep copy default values when using mapstructure, to avoid problems with merging over shared default values. ## Why? - Fixes #6756 - Performance improvement for "plain default" settings (almost all of them) - Performance improvement for settings with complex converters - Remove footgun in defaults that aren't scalar values ## How did you test it? existing+new unit tests commit f9bd083 Author: Lina Jodoin <[email protected]> Date: Thu Aug 7 17:11:53 2025 -0700 [Scheduled Actions] Update Scheduler protos for CHASM (#8163) ## What changed? - Added protos for the new Scheduler task types. - Added TODOs for cleanup when the HSM component is removed. ## Why? - A few fields and messages were made obsolete with the CHASM port. commit f8b97e5 Author: Vladyslav Simonenko <[email protected]> Date: Thu Aug 7 16:24:54 2025 -0700 Break out of pagination in scavenger on errors (#8133) ## What changed? Break out of the loop, when iteration through mutable states fails ## Why? Previously, we continued to iterate, leading to the panic: #8037 ## How did you test it? - [X] run locally and tested manually - [X] added new unit test(s)
commit 9f30a1d Author: Lina Jodoin <[email protected]> Date: Wed Aug 27 16:08:12 2025 -0700 [Scheduled Actions] Use the Execution returned from PollMutableState when calling GetWorkflowExecutionHistory (#8207) ## What changed? - In WatchWorkflow, we'll now use the Execution returned as a result from `PollMutableState`, instead of the Execution we used as part of the `PollMutableState` request. ## Why? - We have a likely race condition if a workflow starts and completes during the `PollMutableState` call, where our originally-requested Execution is no longer the latest. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8b717ec Author: Roman Dmytrenko <[email protected]> Date: Wed Aug 27 19:34:05 2025 +0000 chore(deps): upgrade go from 1.24.5 to 1.25.0 (#8209) ## What changed? Upgrade go to the 1.25.0 ## How did you test it? - [x] built - [x] run locally and tested manually ~Blocked by #8174~ --------- Signed-off-by: Roman Dmytrenko <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> commit 3f18d90 Author: Stephan Behnke <[email protected]> Date: Tue Aug 26 18:28:26 2025 -0700 GetWorkflowExecutionHistory long poll soft timeout (#8238) ## What changed? Added a "soft timeout" (language used by Workflow Update) to `GetWorkflowExecutionHistory` long polls. ## Why? We don't want to terminate the long poll connection but instead keep it alive by sending a response back just before the timeout. The idea is that this will prevent connections from opening/terminating repeatedly (ie connection churn). ## How did you test it? - [ ] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks I was only able to verify this manually by forcing a timeout in the server and verifying that instead of a deadline exceeded I saw a result. I'll assume this will work since the existing code already tried doing just exactly that, but it didn't do it well. commit 64884b1 Author: Sean Kane <[email protected]> Date: Tue Aug 26 14:26:02 2025 -0600 fix: handle nil ptr in legacy batch processing (#8244) ## What changed? `BatchWorkflow` is not yet deprecated, fix was not properly applied on previous PR ## Why? nilptr exception ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks na commit 96c3aae Author: Kent Gruber <[email protected]> Date: Tue Aug 26 13:23:40 2025 -0400 Use better string splitting techniques where possible (#8226) ## What changed? This PR aims to avoid usage of [`strings.Split`](https://pkg.go.dev/strings#Split) where possible in favor of better string splitting techniques, speficially: [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) where appropriate. There was also a [`strings.Fields`](https://pkg.go.dev/strings#Fields) change I made to use [`strings.FieldsSeq`](https://pkg.go.dev/strings#FieldsSeq) instead, and another for S3 to use the [`path`](https://pkg.go.dev/path) package instead of [`strings.Split`](https://pkg.go.dev/strings#Split). ## Why? [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) are often better options in many cases, and can be _partially_ detected using [`modernize`](https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize): > `stringsseq`: replace Split in "for range strings.Split(...)" by go1.24's more efficient `SplitSeq`, or `Fields` with `FieldSeq`. ## How did you test it? - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks There are lots of potentially subtle behaviors from the `strings.Split` (and `strings.Fields`) usage that should be accounted for. If our existing tests don't cover those subtleties, there's risk for introducing an unintended bug. More intricate handling/parsing previously using the `strings` package should get extra attention from reviewers. I've attempted to break up my changes into logical commit chunks to aid in review / help spot potentially concerning changes. commit ebcc3fd Author: Sean Kane <[email protected]> Date: Tue Aug 26 09:15:23 2025 -0600 fix: handle nil ptr in batch processing (#8240) ## What changed? Batch workflows were panicking because executions can be nil, but there is no check to prevent nil pointer exception. ## Why? Prevent nil-ptrs ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA commit dfa8b3e Author: Yu Xia <[email protected]> Date: Mon Aug 25 23:40:08 2025 -0700 Adding minimum timeout require in system workflow (#8231) ## What changed? Adding minimum timeout require in system workflow ## Why? The system workflow needs to have sufficient time to execute the defer logic. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8f1ad7c Author: Yu Xia <[email protected]> Date: Mon Aug 25 15:52:12 2025 -0700 Wire up api health monitor component (#8217) ## What changed? Wire up api health monitor component ## Why? The health monitor component did not wire correctly in fx ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 45da266 Author: pdoerner <[email protected]> Date: Mon Aug 25 11:40:10 2025 -0700 Remove dynamic config warnings for shared structures (#8236) ## What changed? Removed warning logs for shared dynamic config structures ## Why? Was failing integration tests commit 2d74130 Author: Hai Zhao <[email protected]> Date: Mon Aug 25 09:12:03 2025 -0700 Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace (#8234) ## What changed? Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace. ## Why? We want to check replication state quickly from cli. ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit 54893bc Author: David Reiss <[email protected]> Date: Fri Aug 22 17:04:29 2025 -0700 Only force-load child partitions after successful initialization (#8230) ## What changed? The force-load child partitions mechanism should only happen after successful initialization of the root. ## Why? If the root fails to load, things can get stuck in a loop where the root loads the children and the children cause the root to be loaded again (from userdata polling). ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 251e20a Author: sivagirish81 <[email protected]> Date: Thu Aug 21 19:08:02 2025 -0700 TaskQueue Fairness Rate Limit (#8135) ## What changed? - Move the rate limit logic for fairness from priMatcher to taskQueuePartitionManagerLevel - Attach the fairness queue rate limit and the per-key rate limit to the simple rate limiter implementation. ## Why? - Implementation of UpdateTaskQueueConfig api for fairness tasks. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks N/A commit 16f7688 Author: Will Duan <[email protected]> Date: Fri Aug 22 07:16:52 2025 +0800 Add log for slow replication tasks (#8225) ## What changed? Log replication task details when processing takes too long ## Why? For operator investigation ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit 934c58d Author: Will Duan <[email protected]> Date: Fri Aug 22 06:51:57 2025 +0800 Fix VerifyVersionedTransition Task (#8227) ## What changed? Fix VerifyVersionedTransition Task ## Why? Without fix, there is risk of success the task without verifying. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit f454f2f Author: Yichao Yang <[email protected]> Date: Thu Aug 21 14:32:35 2025 -0700 Revert history task processing timeout change (#8228) ## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 14d52ce Author: Roman Dmytrenko <[email protected]> Date: Thu Aug 21 15:16:29 2025 +0000 ci: bump golangci-lint from v1.64.8 to v2.4.0 (#8174) ## What changed? Upgrade golangci-lint to v2 ## How did you test it? - [x] run locally and tested manually --------- Signed-off-by: Roman Dmytrenko <[email protected]> commit fb863dc Author: Stephan Behnke <[email protected]> Date: Wed Aug 20 18:19:02 2025 -0700 Explain test build tags and env variables (#7991) ## What changed? Added documentation for test-related build tags and env variables. ## Why? So help developers with their test setup. commit 17c4c07 Author: pdoerner <[email protected]> Date: Wed Aug 20 16:05:45 2025 -0700 Add dynamic config for forwarded Nexus request dispatch type (#8224) ## What changed? Added a new dynamic config to control whether forwarded Nexus HTTP requests should use the same dispatch type as the original request or always use dispatch by namespace + task queue. ## Why? Endpoints do not support replication, so forwarding by endpoint will not work out of the box because the two clusters will have a different ID for the endpoint. commit 1d340f5 Author: pdoerner <[email protected]> Date: Wed Aug 20 15:33:30 2025 -0700 Pass through original HTTP headers for forwarded Nexus requests (#8204) ## What changed? When forwarding Nexus Start/Cancel requests, the original HTTP headers will be passed through without sanitization. ## Why? Some headers that are still needed for the forwarded request may be sanitized during original request processing (e.g. authorization information headers). ## How did you test it? existing tests commit 3b1b8d0 Author: Roey Berman <[email protected]> Date: Wed Aug 20 11:07:49 2025 -0600 Upgrade Go SDK to 1.35.0 (#8216) Had to change WorkerDeploymentOptions and VersioningOverride to work with new SDK. Changed the worker setup for the versioning internal workflow replay tests, so I generated a new set of workflow histories to test with. Normally we generate new workflow histories only when we've made a change to the workflow definitions that we want to test (ie. a new patch), but since the operations to generate the workflow histories had to change slightly, I think it makes sense to generate a fresh set of histories to replay-test with next time there is a change. --------- Co-authored-by: Carly de Frondeville <[email protected]> commit 4d1212a Author: Stephan Behnke <[email protected]> Date: Tue Aug 19 11:52:55 2025 -0700 Fix typo in unprocessedUpdateFailure (#8212) WISOTT commit f38c88a Author: David Reiss <[email protected]> Date: Tue Aug 19 06:51:19 2025 -0700 Allow more retries for matching client polls (#8155) ## What changed? Allow frontend->matching poll requests to retry up to their context timeout instead of just once. ## Why? On matching service deployments, a busy new matching node may hit its persistence rps limit trying to acquire new task queues and be unable to accept polls. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit fdb7f31 Author: Yu Xia <[email protected]> Date: Mon Aug 18 16:24:42 2025 -0700 Change sys background low to use the correct level (#8208) ## What changed? Change sys background low to use the correct level ## Why? Fix this based on the variable name ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit e4b378f Author: David Reiss <[email protected]> Date: Mon Aug 18 15:34:34 2025 -0700 Support subscriptions to settings with constrained defaults (#8180) ## What changed? Fill in support for subscriptions to dynamic config values with constrained defaults. ## Why? We'd like to use this combination of functionality. ## How did you test it? - [x] added new unit test(s) commit 8ed0361 Author: David Reiss <[email protected]> Date: Mon Aug 18 15:29:17 2025 -0700 Allow empty data in DataBlob (#8181) ## What changed? Remove check for zero-length data in NewDataBlob. ## Why? Zero-length data is a valid encoding for some encodings, e.g. proto3. NewDataBlob should not have an opinion on the length of data. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Some code may be making assumptions about this behavior. commit 40ac028 Author: David Reiss <[email protected]> Date: Mon Aug 18 12:41:41 2025 -0700 Warn on dynamic config default values with shared structure (#8176) ## What changed? Log softassert warnings if dynamic config settings are registered with default values with shared structure. ## Why? This is very likely unintended and may lead to unexpected behavior of settings (values will be parsed on top of a copy of the default). ## How did you test it? - [x] run locally and tested manually - [x] added new unit test(s) commit 9c75cd6 Author: pdoerner <[email protected]> Date: Fri Aug 15 12:43:47 2025 -0700 Forward Nexus requests using same dispatch type as original request (#8199) ## What changed? When forwarding Nexus requests that were originally sent to the `DispatchByEndpoint` URL, the forwarding URL will also be constructed to send the request to the `DispatchByEndpoint` URL on the remote cluster. Previously, we were always sending forwarding requests using `DispatchByNamespaceAndTaskQueue` ## Why? bug fix ## How did you test it? existing tests commit 21f556c Author: Roey Berman <[email protected]> Date: Fri Aug 15 13:13:28 2025 -0600 Commit generated scheduler protos (#8200) ## What - Commit generated scheduler protos. - Improve `make ensure-no-changes` to detect untracked files. ## Why? The protos were not generated since the tool was committed in a separate PR from where the protos were added. commit 4c59cd1 Author: Roey Berman <[email protected]> Date: Fri Aug 15 12:09:58 2025 -0600 Add support for protos in chasm libs (#8182) ## What changed? Added support for defining protos in chasm libs. ## Why? Keep everything local to the library. ## How did you test it? - [x] built - [x] run locally and tested manually commit 08e2dfd Author: pdoerner <[email protected]> Date: Thu Aug 14 16:49:44 2025 -0700 Reconstruct failure for forwarded Nexus completion requests (#8198) ## What changed? When forwarding a `CompleteNexusOperation` HTTP request that contains a failure, the completion will be reconstructed instead of reusing the original request body. ## Why? The Nexus SDK reads and closes the HTTP request body when the operation state is `failed` or `canceled` so we cannot reuse it for the forwarded request. For `successful` operations, the SDK just passes on the result content in the form of a `nexus.LazyValue` which we can forward directly since it is not read or closed. ## How did you test it? new functional xdc tests commit 9d82cae Author: Lina Jodoin <[email protected]> Date: Thu Aug 14 16:06:01 2025 -0700 Fix BufferedStart reference in chasm scheduler proto (#8197) ## What changed? _Describe what has changed in this PR._ ## Why? _Tell your future self why have you made these changes._ ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks _Any change is risky. Identify all risks you are aware of. If none, remove this section._ commit 41cce70 Author: Vladyslav Simonenko <[email protected]> Date: Wed Aug 13 15:34:38 2025 -0700 Produce workflow_duration metric on completion (#8185) ## What changed? This PR produces the metric workflow_duration, when the workflow execution completes. ## Why? Currently there is no metric that captures the duration of the workflow execution. It's also valuable to have the duration broken down by task queue, namespace, workflow type, which this PR enables ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) commit a1df862 Author: Lina Jodoin <[email protected]> Date: Wed Aug 13 15:30:48 2025 -0700 [CHASM Scheduler] Move scheduler protobufs to scheduler/proto package (#8189) ## What changed? - CHASM scheduler protos are moved to live alongside the scheduler implementation code, within the `chasm` package. ## Why? - See #8182. Sending this PR in advance, as that PR asserts protobufs were generated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8fe5cee Author: Sean Kane <[email protected]> Date: Thu Aug 14 00:10:06 2025 +0200 improvement: remove waits before fetching activities (#8144) ## What changed? optimize the batch operation processing in `BatchActivity` and `BatchActivityWithProtobuf` by removing the need to wait for entire pages to complete before fetching the next page. - Implemented proactive page fetching once a worker becomes available - common `processWorkflowsWithProactiveFetching` function to reduce code duplication ## Why? The previous implementation had workers wait for entire pages to complete. This optimization improves resource utilization. The refactoring also eliminates duplicated functions in the `BatchParams` struct and `BatchOperation` protobuf. Addresses issue #8098. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) The changes maintain backward compatibility. ## Potential risks While this change improves performance, it does modify the concurrency model of batch processing: 1. **Timing changes**: The optimization changes when pages are fetched relative to task completion, which could expose edge cases in error handling or heartbeat timing 2. **Memory usage**: Pages may be fetched earlier, potentially increasing peak memory usage if the next page is large 3. **Rate limiting interaction**: The more aggressive task scheduling could interact differently with rate limiting, though the same per-worker limits are maintained 4. **Heartbeat behavior**: heartbeats track the progress of an entire page and are applied after an entire page finishes The changes preserve all existing error handling, retry logic, and rate limiting behavior, but the different execution timing could surface previously hidden race conditions. --------- Co-authored-by: Roey Berman <[email protected]> commit 469526e Author: pdoerner <[email protected]> Date: Wed Aug 13 09:58:03 2025 -0700 Change default for `component.nexusoperations.recordCancelRequestCompletionEvents` (#8191) ## What changed? Changed default for `component.nexusoperations.recordCancelRequestCompletionEvents` to `true` ## Why? Flag was added to ensure backwards compatibility. Now that 1.28 is released, can change the default. Flag will be removed after 1.29 is released. commit 69e6b6c Author: Rodrigo Zhou <[email protected]> Date: Tue Aug 12 11:31:27 2025 -0700 Bump Temporal API to v1.52.0 (#8187) ## What changed? Bump Temporal API to v1.52.0 ## Why? Bump Temporal API to v1.52.0 ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit da90f62 Author: Stephan Behnke <[email protected]> Date: Fri Aug 8 14:43:15 2025 -0700 Decode of nil data (#8179) ## What changed? Don't catch `Data: nil` in test; let it fall through to decoder. The decoder will return an error. An error is the better choice than a `nil` response since that signals to the user that the decoded data is usable/valid. ## Why? Follow-up to #8111; an internal test expects an error instead of `nil`. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Hard to believe that returning nil and using un-decoded data is a good/valid alternative. commit b8497fa Author: David Reiss <[email protected]> Date: Fri Aug 8 08:32:58 2025 -0700 Dynamic config conversion improvements (#7052) ## What changed? - Split implementation of "constrained default" settings from "plain default" settings. This is more code and the diff looks complex, but the individual paths are both simpler than the mixed version. - Add conversion cache using a weak map. - Remove GlobalCachedTypedValue. - Use "raw" values for subscription dispatch deduping to avoid unnecessary conversions. - Deep copy default values when using mapstructure, to avoid problems with merging over shared default values. ## Why? - Fixes #6756 - Performance improvement for "plain default" settings (almost all of them) - Performance improvement for settings with complex converters - Remove footgun in defaults that aren't scalar values ## How did you test it? existing+new unit tests commit f9bd083 Author: Lina Jodoin <[email protected]> Date: Thu Aug 7 17:11:53 2025 -0700 [Scheduled Actions] Update Scheduler protos for CHASM (#8163) ## What changed? - Added protos for the new Scheduler task types. - Added TODOs for cleanup when the HSM component is removed. ## Why? - A few fields and messages were made obsolete with the CHASM port. commit f8b97e5 Author: Vladyslav Simonenko <[email protected]> Date: Thu Aug 7 16:24:54 2025 -0700 Break out of pagination in scavenger on errors (#8133) ## What changed? Break out of the loop, when iteration through mutable states fails ## Why? Previously, we continued to iterate, leading to the panic: #8037 ## How did you test it? - [X] run locally and tested manually - [X] added new unit test(s)
commit 9f30a1d Author: Lina Jodoin <[email protected]> Date: Wed Aug 27 16:08:12 2025 -0700 [Scheduled Actions] Use the Execution returned from PollMutableState when calling GetWorkflowExecutionHistory (#8207) ## What changed? - In WatchWorkflow, we'll now use the Execution returned as a result from `PollMutableState`, instead of the Execution we used as part of the `PollMutableState` request. ## Why? - We have a likely race condition if a workflow starts and completes during the `PollMutableState` call, where our originally-requested Execution is no longer the latest. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8b717ec Author: Roman Dmytrenko <[email protected]> Date: Wed Aug 27 19:34:05 2025 +0000 chore(deps): upgrade go from 1.24.5 to 1.25.0 (#8209) ## What changed? Upgrade go to the 1.25.0 ## How did you test it? - [x] built - [x] run locally and tested manually ~Blocked by #8174~ --------- Signed-off-by: Roman Dmytrenko <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> commit 3f18d90 Author: Stephan Behnke <[email protected]> Date: Tue Aug 26 18:28:26 2025 -0700 GetWorkflowExecutionHistory long poll soft timeout (#8238) ## What changed? Added a "soft timeout" (language used by Workflow Update) to `GetWorkflowExecutionHistory` long polls. ## Why? We don't want to terminate the long poll connection but instead keep it alive by sending a response back just before the timeout. The idea is that this will prevent connections from opening/terminating repeatedly (ie connection churn). ## How did you test it? - [ ] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks I was only able to verify this manually by forcing a timeout in the server and verifying that instead of a deadline exceeded I saw a result. I'll assume this will work since the existing code already tried doing just exactly that, but it didn't do it well. commit 64884b1 Author: Sean Kane <[email protected]> Date: Tue Aug 26 14:26:02 2025 -0600 fix: handle nil ptr in legacy batch processing (#8244) ## What changed? `BatchWorkflow` is not yet deprecated, fix was not properly applied on previous PR ## Why? nilptr exception ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks na commit 96c3aae Author: Kent Gruber <[email protected]> Date: Tue Aug 26 13:23:40 2025 -0400 Use better string splitting techniques where possible (#8226) ## What changed? This PR aims to avoid usage of [`strings.Split`](https://pkg.go.dev/strings#Split) where possible in favor of better string splitting techniques, speficially: [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) where appropriate. There was also a [`strings.Fields`](https://pkg.go.dev/strings#Fields) change I made to use [`strings.FieldsSeq`](https://pkg.go.dev/strings#FieldsSeq) instead, and another for S3 to use the [`path`](https://pkg.go.dev/path) package instead of [`strings.Split`](https://pkg.go.dev/strings#Split). ## Why? [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) are often better options in many cases, and can be _partially_ detected using [`modernize`](https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize): > `stringsseq`: replace Split in "for range strings.Split(...)" by go1.24's more efficient `SplitSeq`, or `Fields` with `FieldSeq`. ## How did you test it? - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks There are lots of potentially subtle behaviors from the `strings.Split` (and `strings.Fields`) usage that should be accounted for. If our existing tests don't cover those subtleties, there's risk for introducing an unintended bug. More intricate handling/parsing previously using the `strings` package should get extra attention from reviewers. I've attempted to break up my changes into logical commit chunks to aid in review / help spot potentially concerning changes. commit ebcc3fd Author: Sean Kane <[email protected]> Date: Tue Aug 26 09:15:23 2025 -0600 fix: handle nil ptr in batch processing (#8240) ## What changed? Batch workflows were panicking because executions can be nil, but there is no check to prevent nil pointer exception. ## Why? Prevent nil-ptrs ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA commit dfa8b3e Author: Yu Xia <[email protected]> Date: Mon Aug 25 23:40:08 2025 -0700 Adding minimum timeout require in system workflow (#8231) ## What changed? Adding minimum timeout require in system workflow ## Why? The system workflow needs to have sufficient time to execute the defer logic. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8f1ad7c Author: Yu Xia <[email protected]> Date: Mon Aug 25 15:52:12 2025 -0700 Wire up api health monitor component (#8217) ## What changed? Wire up api health monitor component ## Why? The health monitor component did not wire correctly in fx ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 45da266 Author: pdoerner <[email protected]> Date: Mon Aug 25 11:40:10 2025 -0700 Remove dynamic config warnings for shared structures (#8236) ## What changed? Removed warning logs for shared dynamic config structures ## Why? Was failing integration tests commit 2d74130 Author: Hai Zhao <[email protected]> Date: Mon Aug 25 09:12:03 2025 -0700 Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace (#8234) ## What changed? Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace. ## Why? We want to check replication state quickly from cli. ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit 54893bc Author: David Reiss <[email protected]> Date: Fri Aug 22 17:04:29 2025 -0700 Only force-load child partitions after successful initialization (#8230) ## What changed? The force-load child partitions mechanism should only happen after successful initialization of the root. ## Why? If the root fails to load, things can get stuck in a loop where the root loads the children and the children cause the root to be loaded again (from userdata polling). ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 251e20a Author: sivagirish81 <[email protected]> Date: Thu Aug 21 19:08:02 2025 -0700 TaskQueue Fairness Rate Limit (#8135) ## What changed? - Move the rate limit logic for fairness from priMatcher to taskQueuePartitionManagerLevel - Attach the fairness queue rate limit and the per-key rate limit to the simple rate limiter implementation. ## Why? - Implementation of UpdateTaskQueueConfig api for fairness tasks. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks N/A commit 16f7688 Author: Will Duan <[email protected]> Date: Fri Aug 22 07:16:52 2025 +0800 Add log for slow replication tasks (#8225) ## What changed? Log replication task details when processing takes too long ## Why? For operator investigation ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit 934c58d Author: Will Duan <[email protected]> Date: Fri Aug 22 06:51:57 2025 +0800 Fix VerifyVersionedTransition Task (#8227) ## What changed? Fix VerifyVersionedTransition Task ## Why? Without fix, there is risk of success the task without verifying. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit f454f2f Author: Yichao Yang <[email protected]> Date: Thu Aug 21 14:32:35 2025 -0700 Revert history task processing timeout change (#8228) ## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 14d52ce Author: Roman Dmytrenko <[email protected]> Date: Thu Aug 21 15:16:29 2025 +0000 ci: bump golangci-lint from v1.64.8 to v2.4.0 (#8174) ## What changed? Upgrade golangci-lint to v2 ## How did you test it? - [x] run locally and tested manually --------- Signed-off-by: Roman Dmytrenko <[email protected]> commit fb863dc Author: Stephan Behnke <[email protected]> Date: Wed Aug 20 18:19:02 2025 -0700 Explain test build tags and env variables (#7991) ## What changed? Added documentation for test-related build tags and env variables. ## Why? So help developers with their test setup. commit 17c4c07 Author: pdoerner <[email protected]> Date: Wed Aug 20 16:05:45 2025 -0700 Add dynamic config for forwarded Nexus request dispatch type (#8224) ## What changed? Added a new dynamic config to control whether forwarded Nexus HTTP requests should use the same dispatch type as the original request or always use dispatch by namespace + task queue. ## Why? Endpoints do not support replication, so forwarding by endpoint will not work out of the box because the two clusters will have a different ID for the endpoint. commit 1d340f5 Author: pdoerner <[email protected]> Date: Wed Aug 20 15:33:30 2025 -0700 Pass through original HTTP headers for forwarded Nexus requests (#8204) ## What changed? When forwarding Nexus Start/Cancel requests, the original HTTP headers will be passed through without sanitization. ## Why? Some headers that are still needed for the forwarded request may be sanitized during original request processing (e.g. authorization information headers). ## How did you test it? existing tests commit 3b1b8d0 Author: Roey Berman <[email protected]> Date: Wed Aug 20 11:07:49 2025 -0600 Upgrade Go SDK to 1.35.0 (#8216) Had to change WorkerDeploymentOptions and VersioningOverride to work with new SDK. Changed the worker setup for the versioning internal workflow replay tests, so I generated a new set of workflow histories to test with. Normally we generate new workflow histories only when we've made a change to the workflow definitions that we want to test (ie. a new patch), but since the operations to generate the workflow histories had to change slightly, I think it makes sense to generate a fresh set of histories to replay-test with next time there is a change. --------- Co-authored-by: Carly de Frondeville <[email protected]> commit 4d1212a Author: Stephan Behnke <[email protected]> Date: Tue Aug 19 11:52:55 2025 -0700 Fix typo in unprocessedUpdateFailure (#8212) WISOTT commit f38c88a Author: David Reiss <[email protected]> Date: Tue Aug 19 06:51:19 2025 -0700 Allow more retries for matching client polls (#8155) ## What changed? Allow frontend->matching poll requests to retry up to their context timeout instead of just once. ## Why? On matching service deployments, a busy new matching node may hit its persistence rps limit trying to acquire new task queues and be unable to accept polls. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit fdb7f31 Author: Yu Xia <[email protected]> Date: Mon Aug 18 16:24:42 2025 -0700 Change sys background low to use the correct level (#8208) ## What changed? Change sys background low to use the correct level ## Why? Fix this based on the variable name ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit e4b378f Author: David Reiss <[email protected]> Date: Mon Aug 18 15:34:34 2025 -0700 Support subscriptions to settings with constrained defaults (#8180) ## What changed? Fill in support for subscriptions to dynamic config values with constrained defaults. ## Why? We'd like to use this combination of functionality. ## How did you test it? - [x] added new unit test(s) commit 8ed0361 Author: David Reiss <[email protected]> Date: Mon Aug 18 15:29:17 2025 -0700 Allow empty data in DataBlob (#8181) ## What changed? Remove check for zero-length data in NewDataBlob. ## Why? Zero-length data is a valid encoding for some encodings, e.g. proto3. NewDataBlob should not have an opinion on the length of data. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Some code may be making assumptions about this behavior. commit 40ac028 Author: David Reiss <[email protected]> Date: Mon Aug 18 12:41:41 2025 -0700 Warn on dynamic config default values with shared structure (#8176) ## What changed? Log softassert warnings if dynamic config settings are registered with default values with shared structure. ## Why? This is very likely unintended and may lead to unexpected behavior of settings (values will be parsed on top of a copy of the default). ## How did you test it? - [x] run locally and tested manually - [x] added new unit test(s) commit 9c75cd6 Author: pdoerner <[email protected]> Date: Fri Aug 15 12:43:47 2025 -0700 Forward Nexus requests using same dispatch type as original request (#8199) ## What changed? When forwarding Nexus requests that were originally sent to the `DispatchByEndpoint` URL, the forwarding URL will also be constructed to send the request to the `DispatchByEndpoint` URL on the remote cluster. Previously, we were always sending forwarding requests using `DispatchByNamespaceAndTaskQueue` ## Why? bug fix ## How did you test it? existing tests commit 21f556c Author: Roey Berman <[email protected]> Date: Fri Aug 15 13:13:28 2025 -0600 Commit generated scheduler protos (#8200) ## What - Commit generated scheduler protos. - Improve `make ensure-no-changes` to detect untracked files. ## Why? The protos were not generated since the tool was committed in a separate PR from where the protos were added. commit 4c59cd1 Author: Roey Berman <[email protected]> Date: Fri Aug 15 12:09:58 2025 -0600 Add support for protos in chasm libs (#8182) ## What changed? Added support for defining protos in chasm libs. ## Why? Keep everything local to the library. ## How did you test it? - [x] built - [x] run locally and tested manually commit 08e2dfd Author: pdoerner <[email protected]> Date: Thu Aug 14 16:49:44 2025 -0700 Reconstruct failure for forwarded Nexus completion requests (#8198) ## What changed? When forwarding a `CompleteNexusOperation` HTTP request that contains a failure, the completion will be reconstructed instead of reusing the original request body. ## Why? The Nexus SDK reads and closes the HTTP request body when the operation state is `failed` or `canceled` so we cannot reuse it for the forwarded request. For `successful` operations, the SDK just passes on the result content in the form of a `nexus.LazyValue` which we can forward directly since it is not read or closed. ## How did you test it? new functional xdc tests commit 9d82cae Author: Lina Jodoin <[email protected]> Date: Thu Aug 14 16:06:01 2025 -0700 Fix BufferedStart reference in chasm scheduler proto (#8197) ## What changed? _Describe what has changed in this PR._ ## Why? _Tell your future self why have you made these changes._ ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks _Any change is risky. Identify all risks you are aware of. If none, remove this section._ commit 41cce70 Author: Vladyslav Simonenko <[email protected]> Date: Wed Aug 13 15:34:38 2025 -0700 Produce workflow_duration metric on completion (#8185) ## What changed? This PR produces the metric workflow_duration, when the workflow execution completes. ## Why? Currently there is no metric that captures the duration of the workflow execution. It's also valuable to have the duration broken down by task queue, namespace, workflow type, which this PR enables ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) commit a1df862 Author: Lina Jodoin <[email protected]> Date: Wed Aug 13 15:30:48 2025 -0700 [CHASM Scheduler] Move scheduler protobufs to scheduler/proto package (#8189) ## What changed? - CHASM scheduler protos are moved to live alongside the scheduler implementation code, within the `chasm` package. ## Why? - See #8182. Sending this PR in advance, as that PR asserts protobufs were generated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8fe5cee Author: Sean Kane <[email protected]> Date: Thu Aug 14 00:10:06 2025 +0200 improvement: remove waits before fetching activities (#8144) ## What changed? optimize the batch operation processing in `BatchActivity` and `BatchActivityWithProtobuf` by removing the need to wait for entire pages to complete before fetching the next page. - Implemented proactive page fetching once a worker becomes available - common `processWorkflowsWithProactiveFetching` function to reduce code duplication ## Why? The previous implementation had workers wait for entire pages to complete. This optimization improves resource utilization. The refactoring also eliminates duplicated functions in the `BatchParams` struct and `BatchOperation` protobuf. Addresses issue #8098. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) The changes maintain backward compatibility. ## Potential risks While this change improves performance, it does modify the concurrency model of batch processing: 1. **Timing changes**: The optimization changes when pages are fetched relative to task completion, which could expose edge cases in error handling or heartbeat timing 2. **Memory usage**: Pages may be fetched earlier, potentially increasing peak memory usage if the next page is large 3. **Rate limiting interaction**: The more aggressive task scheduling could interact differently with rate limiting, though the same per-worker limits are maintained 4. **Heartbeat behavior**: heartbeats track the progress of an entire page and are applied after an entire page finishes The changes preserve all existing error handling, retry logic, and rate limiting behavior, but the different execution timing could surface previously hidden race conditions. --------- Co-authored-by: Roey Berman <[email protected]> commit 469526e Author: pdoerner <[email protected]> Date: Wed Aug 13 09:58:03 2025 -0700 Change default for `component.nexusoperations.recordCancelRequestCompletionEvents` (#8191) ## What changed? Changed default for `component.nexusoperations.recordCancelRequestCompletionEvents` to `true` ## Why? Flag was added to ensure backwards compatibility. Now that 1.28 is released, can change the default. Flag will be removed after 1.29 is released. commit 69e6b6c Author: Rodrigo Zhou <[email protected]> Date: Tue Aug 12 11:31:27 2025 -0700 Bump Temporal API to v1.52.0 (#8187) ## What changed? Bump Temporal API to v1.52.0 ## Why? Bump Temporal API to v1.52.0 ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit da90f62 Author: Stephan Behnke <[email protected]> Date: Fri Aug 8 14:43:15 2025 -0700 Decode of nil data (#8179) ## What changed? Don't catch `Data: nil` in test; let it fall through to decoder. The decoder will return an error. An error is the better choice than a `nil` response since that signals to the user that the decoded data is usable/valid. ## Why? Follow-up to #8111; an internal test expects an error instead of `nil`. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Hard to believe that returning nil and using un-decoded data is a good/valid alternative. commit b8497fa Author: David Reiss <[email protected]> Date: Fri Aug 8 08:32:58 2025 -0700 Dynamic config conversion improvements (#7052) ## What changed? - Split implementation of "constrained default" settings from "plain default" settings. This is more code and the diff looks complex, but the individual paths are both simpler than the mixed version. - Add conversion cache using a weak map. - Remove GlobalCachedTypedValue. - Use "raw" values for subscription dispatch deduping to avoid unnecessary conversions. - Deep copy default values when using mapstructure, to avoid problems with merging over shared default values. ## Why? - Fixes #6756 - Performance improvement for "plain default" settings (almost all of them) - Performance improvement for settings with complex converters - Remove footgun in defaults that aren't scalar values ## How did you test it? existing+new unit tests commit f9bd083 Author: Lina Jodoin <[email protected]> Date: Thu Aug 7 17:11:53 2025 -0700 [Scheduled Actions] Update Scheduler protos for CHASM (#8163) ## What changed? - Added protos for the new Scheduler task types. - Added TODOs for cleanup when the HSM component is removed. ## Why? - A few fields and messages were made obsolete with the CHASM port. commit f8b97e5 Author: Vladyslav Simonenko <[email protected]> Date: Thu Aug 7 16:24:54 2025 -0700 Break out of pagination in scavenger on errors (#8133) ## What changed? Break out of the loop, when iteration through mutable states fails ## Why? Previously, we continued to iterate, leading to the panic: #8037 ## How did you test it? - [X] run locally and tested manually - [X] added new unit test(s)
commit 9f30a1d Author: Lina Jodoin <[email protected]> Date: Wed Aug 27 16:08:12 2025 -0700 [Scheduled Actions] Use the Execution returned from PollMutableState when calling GetWorkflowExecutionHistory (#8207) ## What changed? - In WatchWorkflow, we'll now use the Execution returned as a result from `PollMutableState`, instead of the Execution we used as part of the `PollMutableState` request. ## Why? - We have a likely race condition if a workflow starts and completes during the `PollMutableState` call, where our originally-requested Execution is no longer the latest. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8b717ec Author: Roman Dmytrenko <[email protected]> Date: Wed Aug 27 19:34:05 2025 +0000 chore(deps): upgrade go from 1.24.5 to 1.25.0 (#8209) ## What changed? Upgrade go to the 1.25.0 ## How did you test it? - [x] built - [x] run locally and tested manually ~Blocked by #8174~ --------- Signed-off-by: Roman Dmytrenko <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> commit 3f18d90 Author: Stephan Behnke <[email protected]> Date: Tue Aug 26 18:28:26 2025 -0700 GetWorkflowExecutionHistory long poll soft timeout (#8238) ## What changed? Added a "soft timeout" (language used by Workflow Update) to `GetWorkflowExecutionHistory` long polls. ## Why? We don't want to terminate the long poll connection but instead keep it alive by sending a response back just before the timeout. The idea is that this will prevent connections from opening/terminating repeatedly (ie connection churn). ## How did you test it? - [ ] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks I was only able to verify this manually by forcing a timeout in the server and verifying that instead of a deadline exceeded I saw a result. I'll assume this will work since the existing code already tried doing just exactly that, but it didn't do it well. commit 64884b1 Author: Sean Kane <[email protected]> Date: Tue Aug 26 14:26:02 2025 -0600 fix: handle nil ptr in legacy batch processing (#8244) ## What changed? `BatchWorkflow` is not yet deprecated, fix was not properly applied on previous PR ## Why? nilptr exception ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks na commit 96c3aae Author: Kent Gruber <[email protected]> Date: Tue Aug 26 13:23:40 2025 -0400 Use better string splitting techniques where possible (#8226) ## What changed? This PR aims to avoid usage of [`strings.Split`](https://pkg.go.dev/strings#Split) where possible in favor of better string splitting techniques, speficially: [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) where appropriate. There was also a [`strings.Fields`](https://pkg.go.dev/strings#Fields) change I made to use [`strings.FieldsSeq`](https://pkg.go.dev/strings#FieldsSeq) instead, and another for S3 to use the [`path`](https://pkg.go.dev/path) package instead of [`strings.Split`](https://pkg.go.dev/strings#Split). ## Why? [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) are often better options in many cases, and can be _partially_ detected using [`modernize`](https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize): > `stringsseq`: replace Split in "for range strings.Split(...)" by go1.24's more efficient `SplitSeq`, or `Fields` with `FieldSeq`. ## How did you test it? - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks There are lots of potentially subtle behaviors from the `strings.Split` (and `strings.Fields`) usage that should be accounted for. If our existing tests don't cover those subtleties, there's risk for introducing an unintended bug. More intricate handling/parsing previously using the `strings` package should get extra attention from reviewers. I've attempted to break up my changes into logical commit chunks to aid in review / help spot potentially concerning changes. commit ebcc3fd Author: Sean Kane <[email protected]> Date: Tue Aug 26 09:15:23 2025 -0600 fix: handle nil ptr in batch processing (#8240) ## What changed? Batch workflows were panicking because executions can be nil, but there is no check to prevent nil pointer exception. ## Why? Prevent nil-ptrs ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA commit dfa8b3e Author: Yu Xia <[email protected]> Date: Mon Aug 25 23:40:08 2025 -0700 Adding minimum timeout require in system workflow (#8231) ## What changed? Adding minimum timeout require in system workflow ## Why? The system workflow needs to have sufficient time to execute the defer logic. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8f1ad7c Author: Yu Xia <[email protected]> Date: Mon Aug 25 15:52:12 2025 -0700 Wire up api health monitor component (#8217) ## What changed? Wire up api health monitor component ## Why? The health monitor component did not wire correctly in fx ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 45da266 Author: pdoerner <[email protected]> Date: Mon Aug 25 11:40:10 2025 -0700 Remove dynamic config warnings for shared structures (#8236) ## What changed? Removed warning logs for shared dynamic config structures ## Why? Was failing integration tests commit 2d74130 Author: Hai Zhao <[email protected]> Date: Mon Aug 25 09:12:03 2025 -0700 Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace (#8234) ## What changed? Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace. ## Why? We want to check replication state quickly from cli. ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit 54893bc Author: David Reiss <[email protected]> Date: Fri Aug 22 17:04:29 2025 -0700 Only force-load child partitions after successful initialization (#8230) ## What changed? The force-load child partitions mechanism should only happen after successful initialization of the root. ## Why? If the root fails to load, things can get stuck in a loop where the root loads the children and the children cause the root to be loaded again (from userdata polling). ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 251e20a Author: sivagirish81 <[email protected]> Date: Thu Aug 21 19:08:02 2025 -0700 TaskQueue Fairness Rate Limit (#8135) ## What changed? - Move the rate limit logic for fairness from priMatcher to taskQueuePartitionManagerLevel - Attach the fairness queue rate limit and the per-key rate limit to the simple rate limiter implementation. ## Why? - Implementation of UpdateTaskQueueConfig api for fairness tasks. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks N/A commit 16f7688 Author: Will Duan <[email protected]> Date: Fri Aug 22 07:16:52 2025 +0800 Add log for slow replication tasks (#8225) ## What changed? Log replication task details when processing takes too long ## Why? For operator investigation ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit 934c58d Author: Will Duan <[email protected]> Date: Fri Aug 22 06:51:57 2025 +0800 Fix VerifyVersionedTransition Task (#8227) ## What changed? Fix VerifyVersionedTransition Task ## Why? Without fix, there is risk of success the task without verifying. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit f454f2f Author: Yichao Yang <[email protected]> Date: Thu Aug 21 14:32:35 2025 -0700 Revert history task processing timeout change (#8228) ## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 14d52ce Author: Roman Dmytrenko <[email protected]> Date: Thu Aug 21 15:16:29 2025 +0000 ci: bump golangci-lint from v1.64.8 to v2.4.0 (#8174) ## What changed? Upgrade golangci-lint to v2 ## How did you test it? - [x] run locally and tested manually --------- Signed-off-by: Roman Dmytrenko <[email protected]> commit fb863dc Author: Stephan Behnke <[email protected]> Date: Wed Aug 20 18:19:02 2025 -0700 Explain test build tags and env variables (#7991) ## What changed? Added documentation for test-related build tags and env variables. ## Why? So help developers with their test setup. commit 17c4c07 Author: pdoerner <[email protected]> Date: Wed Aug 20 16:05:45 2025 -0700 Add dynamic config for forwarded Nexus request dispatch type (#8224) ## What changed? Added a new dynamic config to control whether forwarded Nexus HTTP requests should use the same dispatch type as the original request or always use dispatch by namespace + task queue. ## Why? Endpoints do not support replication, so forwarding by endpoint will not work out of the box because the two clusters will have a different ID for the endpoint. commit 1d340f5 Author: pdoerner <[email protected]> Date: Wed Aug 20 15:33:30 2025 -0700 Pass through original HTTP headers for forwarded Nexus requests (#8204) ## What changed? When forwarding Nexus Start/Cancel requests, the original HTTP headers will be passed through without sanitization. ## Why? Some headers that are still needed for the forwarded request may be sanitized during original request processing (e.g. authorization information headers). ## How did you test it? existing tests commit 3b1b8d0 Author: Roey Berman <[email protected]> Date: Wed Aug 20 11:07:49 2025 -0600 Upgrade Go SDK to 1.35.0 (#8216) Had to change WorkerDeploymentOptions and VersioningOverride to work with new SDK. Changed the worker setup for the versioning internal workflow replay tests, so I generated a new set of workflow histories to test with. Normally we generate new workflow histories only when we've made a change to the workflow definitions that we want to test (ie. a new patch), but since the operations to generate the workflow histories had to change slightly, I think it makes sense to generate a fresh set of histories to replay-test with next time there is a change. --------- Co-authored-by: Carly de Frondeville <[email protected]> commit 4d1212a Author: Stephan Behnke <[email protected]> Date: Tue Aug 19 11:52:55 2025 -0700 Fix typo in unprocessedUpdateFailure (#8212) WISOTT commit f38c88a Author: David Reiss <[email protected]> Date: Tue Aug 19 06:51:19 2025 -0700 Allow more retries for matching client polls (#8155) ## What changed? Allow frontend->matching poll requests to retry up to their context timeout instead of just once. ## Why? On matching service deployments, a busy new matching node may hit its persistence rps limit trying to acquire new task queues and be unable to accept polls. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit fdb7f31 Author: Yu Xia <[email protected]> Date: Mon Aug 18 16:24:42 2025 -0700 Change sys background low to use the correct level (#8208) ## What changed? Change sys background low to use the correct level ## Why? Fix this based on the variable name ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit e4b378f Author: David Reiss <[email protected]> Date: Mon Aug 18 15:34:34 2025 -0700 Support subscriptions to settings with constrained defaults (#8180) ## What changed? Fill in support for subscriptions to dynamic config values with constrained defaults. ## Why? We'd like to use this combination of functionality. ## How did you test it? - [x] added new unit test(s) commit 8ed0361 Author: David Reiss <[email protected]> Date: Mon Aug 18 15:29:17 2025 -0700 Allow empty data in DataBlob (#8181) ## What changed? Remove check for zero-length data in NewDataBlob. ## Why? Zero-length data is a valid encoding for some encodings, e.g. proto3. NewDataBlob should not have an opinion on the length of data. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Some code may be making assumptions about this behavior. commit 40ac028 Author: David Reiss <[email protected]> Date: Mon Aug 18 12:41:41 2025 -0700 Warn on dynamic config default values with shared structure (#8176) ## What changed? Log softassert warnings if dynamic config settings are registered with default values with shared structure. ## Why? This is very likely unintended and may lead to unexpected behavior of settings (values will be parsed on top of a copy of the default). ## How did you test it? - [x] run locally and tested manually - [x] added new unit test(s) commit 9c75cd6 Author: pdoerner <[email protected]> Date: Fri Aug 15 12:43:47 2025 -0700 Forward Nexus requests using same dispatch type as original request (#8199) ## What changed? When forwarding Nexus requests that were originally sent to the `DispatchByEndpoint` URL, the forwarding URL will also be constructed to send the request to the `DispatchByEndpoint` URL on the remote cluster. Previously, we were always sending forwarding requests using `DispatchByNamespaceAndTaskQueue` ## Why? bug fix ## How did you test it? existing tests commit 21f556c Author: Roey Berman <[email protected]> Date: Fri Aug 15 13:13:28 2025 -0600 Commit generated scheduler protos (#8200) ## What - Commit generated scheduler protos. - Improve `make ensure-no-changes` to detect untracked files. ## Why? The protos were not generated since the tool was committed in a separate PR from where the protos were added. commit 4c59cd1 Author: Roey Berman <[email protected]> Date: Fri Aug 15 12:09:58 2025 -0600 Add support for protos in chasm libs (#8182) ## What changed? Added support for defining protos in chasm libs. ## Why? Keep everything local to the library. ## How did you test it? - [x] built - [x] run locally and tested manually commit 08e2dfd Author: pdoerner <[email protected]> Date: Thu Aug 14 16:49:44 2025 -0700 Reconstruct failure for forwarded Nexus completion requests (#8198) ## What changed? When forwarding a `CompleteNexusOperation` HTTP request that contains a failure, the completion will be reconstructed instead of reusing the original request body. ## Why? The Nexus SDK reads and closes the HTTP request body when the operation state is `failed` or `canceled` so we cannot reuse it for the forwarded request. For `successful` operations, the SDK just passes on the result content in the form of a `nexus.LazyValue` which we can forward directly since it is not read or closed. ## How did you test it? new functional xdc tests commit 9d82cae Author: Lina Jodoin <[email protected]> Date: Thu Aug 14 16:06:01 2025 -0700 Fix BufferedStart reference in chasm scheduler proto (#8197) ## What changed? _Describe what has changed in this PR._ ## Why? _Tell your future self why have you made these changes._ ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks _Any change is risky. Identify all risks you are aware of. If none, remove this section._ commit 41cce70 Author: Vladyslav Simonenko <[email protected]> Date: Wed Aug 13 15:34:38 2025 -0700 Produce workflow_duration metric on completion (#8185) ## What changed? This PR produces the metric workflow_duration, when the workflow execution completes. ## Why? Currently there is no metric that captures the duration of the workflow execution. It's also valuable to have the duration broken down by task queue, namespace, workflow type, which this PR enables ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) commit a1df862 Author: Lina Jodoin <[email protected]> Date: Wed Aug 13 15:30:48 2025 -0700 [CHASM Scheduler] Move scheduler protobufs to scheduler/proto package (#8189) ## What changed? - CHASM scheduler protos are moved to live alongside the scheduler implementation code, within the `chasm` package. ## Why? - See #8182. Sending this PR in advance, as that PR asserts protobufs were generated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8fe5cee Author: Sean Kane <[email protected]> Date: Thu Aug 14 00:10:06 2025 +0200 improvement: remove waits before fetching activities (#8144) ## What changed? optimize the batch operation processing in `BatchActivity` and `BatchActivityWithProtobuf` by removing the need to wait for entire pages to complete before fetching the next page. - Implemented proactive page fetching once a worker becomes available - common `processWorkflowsWithProactiveFetching` function to reduce code duplication ## Why? The previous implementation had workers wait for entire pages to complete. This optimization improves resource utilization. The refactoring also eliminates duplicated functions in the `BatchParams` struct and `BatchOperation` protobuf. Addresses issue #8098. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) The changes maintain backward compatibility. ## Potential risks While this change improves performance, it does modify the concurrency model of batch processing: 1. **Timing changes**: The optimization changes when pages are fetched relative to task completion, which could expose edge cases in error handling or heartbeat timing 2. **Memory usage**: Pages may be fetched earlier, potentially increasing peak memory usage if the next page is large 3. **Rate limiting interaction**: The more aggressive task scheduling could interact differently with rate limiting, though the same per-worker limits are maintained 4. **Heartbeat behavior**: heartbeats track the progress of an entire page and are applied after an entire page finishes The changes preserve all existing error handling, retry logic, and rate limiting behavior, but the different execution timing could surface previously hidden race conditions. --------- Co-authored-by: Roey Berman <[email protected]> commit 469526e Author: pdoerner <[email protected]> Date: Wed Aug 13 09:58:03 2025 -0700 Change default for `component.nexusoperations.recordCancelRequestCompletionEvents` (#8191) ## What changed? Changed default for `component.nexusoperations.recordCancelRequestCompletionEvents` to `true` ## Why? Flag was added to ensure backwards compatibility. Now that 1.28 is released, can change the default. Flag will be removed after 1.29 is released. commit 69e6b6c Author: Rodrigo Zhou <[email protected]> Date: Tue Aug 12 11:31:27 2025 -0700 Bump Temporal API to v1.52.0 (#8187) ## What changed? Bump Temporal API to v1.52.0 ## Why? Bump Temporal API to v1.52.0 ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit da90f62 Author: Stephan Behnke <[email protected]> Date: Fri Aug 8 14:43:15 2025 -0700 Decode of nil data (#8179) ## What changed? Don't catch `Data: nil` in test; let it fall through to decoder. The decoder will return an error. An error is the better choice than a `nil` response since that signals to the user that the decoded data is usable/valid. ## Why? Follow-up to #8111; an internal test expects an error instead of `nil`. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Hard to believe that returning nil and using un-decoded data is a good/valid alternative. commit b8497fa Author: David Reiss <[email protected]> Date: Fri Aug 8 08:32:58 2025 -0700 Dynamic config conversion improvements (#7052) ## What changed? - Split implementation of "constrained default" settings from "plain default" settings. This is more code and the diff looks complex, but the individual paths are both simpler than the mixed version. - Add conversion cache using a weak map. - Remove GlobalCachedTypedValue. - Use "raw" values for subscription dispatch deduping to avoid unnecessary conversions. - Deep copy default values when using mapstructure, to avoid problems with merging over shared default values. ## Why? - Fixes #6756 - Performance improvement for "plain default" settings (almost all of them) - Performance improvement for settings with complex converters - Remove footgun in defaults that aren't scalar values ## How did you test it? existing+new unit tests commit f9bd083 Author: Lina Jodoin <[email protected]> Date: Thu Aug 7 17:11:53 2025 -0700 [Scheduled Actions] Update Scheduler protos for CHASM (#8163) ## What changed? - Added protos for the new Scheduler task types. - Added TODOs for cleanup when the HSM component is removed. ## Why? - A few fields and messages were made obsolete with the CHASM port. commit f8b97e5 Author: Vladyslav Simonenko <[email protected]> Date: Thu Aug 7 16:24:54 2025 -0700 Break out of pagination in scavenger on errors (#8133) ## What changed? Break out of the loop, when iteration through mutable states fails ## Why? Previously, we continued to iterate, leading to the panic: #8037 ## How did you test it? - [X] run locally and tested manually - [X] added new unit test(s)
commit 9f30a1d Author: Lina Jodoin <[email protected]> Date: Wed Aug 27 16:08:12 2025 -0700 [Scheduled Actions] Use the Execution returned from PollMutableState when calling GetWorkflowExecutionHistory (#8207) ## What changed? - In WatchWorkflow, we'll now use the Execution returned as a result from `PollMutableState`, instead of the Execution we used as part of the `PollMutableState` request. ## Why? - We have a likely race condition if a workflow starts and completes during the `PollMutableState` call, where our originally-requested Execution is no longer the latest. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8b717ec Author: Roman Dmytrenko <[email protected]> Date: Wed Aug 27 19:34:05 2025 +0000 chore(deps): upgrade go from 1.24.5 to 1.25.0 (#8209) ## What changed? Upgrade go to the 1.25.0 ## How did you test it? - [x] built - [x] run locally and tested manually ~Blocked by #8174~ --------- Signed-off-by: Roman Dmytrenko <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> Co-authored-by: Stephan Behnke <[email protected]> commit 3f18d90 Author: Stephan Behnke <[email protected]> Date: Tue Aug 26 18:28:26 2025 -0700 GetWorkflowExecutionHistory long poll soft timeout (#8238) ## What changed? Added a "soft timeout" (language used by Workflow Update) to `GetWorkflowExecutionHistory` long polls. ## Why? We don't want to terminate the long poll connection but instead keep it alive by sending a response back just before the timeout. The idea is that this will prevent connections from opening/terminating repeatedly (ie connection churn). ## How did you test it? - [ ] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks I was only able to verify this manually by forcing a timeout in the server and verifying that instead of a deadline exceeded I saw a result. I'll assume this will work since the existing code already tried doing just exactly that, but it didn't do it well. commit 64884b1 Author: Sean Kane <[email protected]> Date: Tue Aug 26 14:26:02 2025 -0600 fix: handle nil ptr in legacy batch processing (#8244) ## What changed? `BatchWorkflow` is not yet deprecated, fix was not properly applied on previous PR ## Why? nilptr exception ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks na commit 96c3aae Author: Kent Gruber <[email protected]> Date: Tue Aug 26 13:23:40 2025 -0400 Use better string splitting techniques where possible (#8226) ## What changed? This PR aims to avoid usage of [`strings.Split`](https://pkg.go.dev/strings#Split) where possible in favor of better string splitting techniques, speficially: [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) where appropriate. There was also a [`strings.Fields`](https://pkg.go.dev/strings#Fields) change I made to use [`strings.FieldsSeq`](https://pkg.go.dev/strings#FieldsSeq) instead, and another for S3 to use the [`path`](https://pkg.go.dev/path) package instead of [`strings.Split`](https://pkg.go.dev/strings#Split). ## Why? [`strings.SplitN`](https://pkg.go.dev/strings#SplitN) and [`strings.SplitSeq`](https://pkg.go.dev/strings#SplitSeq) are often better options in many cases, and can be _partially_ detected using [`modernize`](https://pkg.go.dev/golang.org/x/tools/gopls/internal/analysis/modernize): > `stringsseq`: replace Split in "for range strings.Split(...)" by go1.24's more efficient `SplitSeq`, or `Fields` with `FieldSeq`. ## How did you test it? - [X] built - [X] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks There are lots of potentially subtle behaviors from the `strings.Split` (and `strings.Fields`) usage that should be accounted for. If our existing tests don't cover those subtleties, there's risk for introducing an unintended bug. More intricate handling/parsing previously using the `strings` package should get extra attention from reviewers. I've attempted to break up my changes into logical commit chunks to aid in review / help spot potentially concerning changes. commit ebcc3fd Author: Sean Kane <[email protected]> Date: Tue Aug 26 09:15:23 2025 -0600 fix: handle nil ptr in batch processing (#8240) ## What changed? Batch workflows were panicking because executions can be nil, but there is no check to prevent nil pointer exception. ## Why? Prevent nil-ptrs ## How did you test it? - [ ] built - [ ] run locally and tested manually - [X] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks NA commit dfa8b3e Author: Yu Xia <[email protected]> Date: Mon Aug 25 23:40:08 2025 -0700 Adding minimum timeout require in system workflow (#8231) ## What changed? Adding minimum timeout require in system workflow ## Why? The system workflow needs to have sufficient time to execute the defer logic. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8f1ad7c Author: Yu Xia <[email protected]> Date: Mon Aug 25 15:52:12 2025 -0700 Wire up api health monitor component (#8217) ## What changed? Wire up api health monitor component ## Why? The health monitor component did not wire correctly in fx ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 45da266 Author: pdoerner <[email protected]> Date: Mon Aug 25 11:40:10 2025 -0700 Remove dynamic config warnings for shared structures (#8236) ## What changed? Removed warning logs for shared dynamic config structures ## Why? Was failing integration tests commit 2d74130 Author: Hai Zhao <[email protected]> Date: Mon Aug 25 09:12:03 2025 -0700 Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace (#8234) ## What changed? Add replication state to response of DescribeNamespace/ListNamespaces/UpdateNamespace. ## Why? We want to check replication state quickly from cli. ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit 54893bc Author: David Reiss <[email protected]> Date: Fri Aug 22 17:04:29 2025 -0700 Only force-load child partitions after successful initialization (#8230) ## What changed? The force-load child partitions mechanism should only happen after successful initialization of the root. ## Why? If the root fails to load, things can get stuck in a loop where the root loads the children and the children cause the root to be loaded again (from userdata polling). ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 251e20a Author: sivagirish81 <[email protected]> Date: Thu Aug 21 19:08:02 2025 -0700 TaskQueue Fairness Rate Limit (#8135) ## What changed? - Move the rate limit logic for fairness from priMatcher to taskQueuePartitionManagerLevel - Attach the fairness queue rate limit and the per-key rate limit to the simple rate limiter implementation. ## Why? - Implementation of UpdateTaskQueueConfig api for fairness tasks. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [x] added new functional test(s) ## Potential risks N/A commit 16f7688 Author: Will Duan <[email protected]> Date: Fri Aug 22 07:16:52 2025 +0800 Add log for slow replication tasks (#8225) ## What changed? Log replication task details when processing takes too long ## Why? For operator investigation ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit 934c58d Author: Will Duan <[email protected]> Date: Fri Aug 22 06:51:57 2025 +0800 Fix VerifyVersionedTransition Task (#8227) ## What changed? Fix VerifyVersionedTransition Task ## Why? Without fix, there is risk of success the task without verifying. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks no risk. commit f454f2f Author: Yichao Yang <[email protected]> Date: Thu Aug 21 14:32:35 2025 -0700 Revert history task processing timeout change (#8228) ## What changed? - Revert history task timeout from 10s to 3s for non-outbound tasks. ## Why? - The original change was made due to a misunderstanding of my comment [here](#7951 (comment)). I was meant to suggest only use 10s as the timeout for outbound chasm tasks. But for other tasks, like transfer/timer, it should still use 3s as the timeout. ## How did you test it? - [x] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 14d52ce Author: Roman Dmytrenko <[email protected]> Date: Thu Aug 21 15:16:29 2025 +0000 ci: bump golangci-lint from v1.64.8 to v2.4.0 (#8174) ## What changed? Upgrade golangci-lint to v2 ## How did you test it? - [x] run locally and tested manually --------- Signed-off-by: Roman Dmytrenko <[email protected]> commit fb863dc Author: Stephan Behnke <[email protected]> Date: Wed Aug 20 18:19:02 2025 -0700 Explain test build tags and env variables (#7991) ## What changed? Added documentation for test-related build tags and env variables. ## Why? So help developers with their test setup. commit 17c4c07 Author: pdoerner <[email protected]> Date: Wed Aug 20 16:05:45 2025 -0700 Add dynamic config for forwarded Nexus request dispatch type (#8224) ## What changed? Added a new dynamic config to control whether forwarded Nexus HTTP requests should use the same dispatch type as the original request or always use dispatch by namespace + task queue. ## Why? Endpoints do not support replication, so forwarding by endpoint will not work out of the box because the two clusters will have a different ID for the endpoint. commit 1d340f5 Author: pdoerner <[email protected]> Date: Wed Aug 20 15:33:30 2025 -0700 Pass through original HTTP headers for forwarded Nexus requests (#8204) ## What changed? When forwarding Nexus Start/Cancel requests, the original HTTP headers will be passed through without sanitization. ## Why? Some headers that are still needed for the forwarded request may be sanitized during original request processing (e.g. authorization information headers). ## How did you test it? existing tests commit 3b1b8d0 Author: Roey Berman <[email protected]> Date: Wed Aug 20 11:07:49 2025 -0600 Upgrade Go SDK to 1.35.0 (#8216) Had to change WorkerDeploymentOptions and VersioningOverride to work with new SDK. Changed the worker setup for the versioning internal workflow replay tests, so I generated a new set of workflow histories to test with. Normally we generate new workflow histories only when we've made a change to the workflow definitions that we want to test (ie. a new patch), but since the operations to generate the workflow histories had to change slightly, I think it makes sense to generate a fresh set of histories to replay-test with next time there is a change. --------- Co-authored-by: Carly de Frondeville <[email protected]> commit 4d1212a Author: Stephan Behnke <[email protected]> Date: Tue Aug 19 11:52:55 2025 -0700 Fix typo in unprocessedUpdateFailure (#8212) WISOTT commit f38c88a Author: David Reiss <[email protected]> Date: Tue Aug 19 06:51:19 2025 -0700 Allow more retries for matching client polls (#8155) ## What changed? Allow frontend->matching poll requests to retry up to their context timeout instead of just once. ## Why? On matching service deployments, a busy new matching node may hit its persistence rps limit trying to acquire new task queues and be unable to accept polls. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit fdb7f31 Author: Yu Xia <[email protected]> Date: Mon Aug 18 16:24:42 2025 -0700 Change sys background low to use the correct level (#8208) ## What changed? Change sys background low to use the correct level ## Why? Fix this based on the variable name ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit e4b378f Author: David Reiss <[email protected]> Date: Mon Aug 18 15:34:34 2025 -0700 Support subscriptions to settings with constrained defaults (#8180) ## What changed? Fill in support for subscriptions to dynamic config values with constrained defaults. ## Why? We'd like to use this combination of functionality. ## How did you test it? - [x] added new unit test(s) commit 8ed0361 Author: David Reiss <[email protected]> Date: Mon Aug 18 15:29:17 2025 -0700 Allow empty data in DataBlob (#8181) ## What changed? Remove check for zero-length data in NewDataBlob. ## Why? Zero-length data is a valid encoding for some encodings, e.g. proto3. NewDataBlob should not have an opinion on the length of data. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Some code may be making assumptions about this behavior. commit 40ac028 Author: David Reiss <[email protected]> Date: Mon Aug 18 12:41:41 2025 -0700 Warn on dynamic config default values with shared structure (#8176) ## What changed? Log softassert warnings if dynamic config settings are registered with default values with shared structure. ## Why? This is very likely unintended and may lead to unexpected behavior of settings (values will be parsed on top of a copy of the default). ## How did you test it? - [x] run locally and tested manually - [x] added new unit test(s) commit 9c75cd6 Author: pdoerner <[email protected]> Date: Fri Aug 15 12:43:47 2025 -0700 Forward Nexus requests using same dispatch type as original request (#8199) ## What changed? When forwarding Nexus requests that were originally sent to the `DispatchByEndpoint` URL, the forwarding URL will also be constructed to send the request to the `DispatchByEndpoint` URL on the remote cluster. Previously, we were always sending forwarding requests using `DispatchByNamespaceAndTaskQueue` ## Why? bug fix ## How did you test it? existing tests commit 21f556c Author: Roey Berman <[email protected]> Date: Fri Aug 15 13:13:28 2025 -0600 Commit generated scheduler protos (#8200) ## What - Commit generated scheduler protos. - Improve `make ensure-no-changes` to detect untracked files. ## Why? The protos were not generated since the tool was committed in a separate PR from where the protos were added. commit 4c59cd1 Author: Roey Berman <[email protected]> Date: Fri Aug 15 12:09:58 2025 -0600 Add support for protos in chasm libs (#8182) ## What changed? Added support for defining protos in chasm libs. ## Why? Keep everything local to the library. ## How did you test it? - [x] built - [x] run locally and tested manually commit 08e2dfd Author: pdoerner <[email protected]> Date: Thu Aug 14 16:49:44 2025 -0700 Reconstruct failure for forwarded Nexus completion requests (#8198) ## What changed? When forwarding a `CompleteNexusOperation` HTTP request that contains a failure, the completion will be reconstructed instead of reusing the original request body. ## Why? The Nexus SDK reads and closes the HTTP request body when the operation state is `failed` or `canceled` so we cannot reuse it for the forwarded request. For `successful` operations, the SDK just passes on the result content in the form of a `nexus.LazyValue` which we can forward directly since it is not read or closed. ## How did you test it? new functional xdc tests commit 9d82cae Author: Lina Jodoin <[email protected]> Date: Thu Aug 14 16:06:01 2025 -0700 Fix BufferedStart reference in chasm scheduler proto (#8197) ## What changed? _Describe what has changed in this PR._ ## Why? _Tell your future self why have you made these changes._ ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks _Any change is risky. Identify all risks you are aware of. If none, remove this section._ commit 41cce70 Author: Vladyslav Simonenko <[email protected]> Date: Wed Aug 13 15:34:38 2025 -0700 Produce workflow_duration metric on completion (#8185) ## What changed? This PR produces the metric workflow_duration, when the workflow execution completes. ## Why? Currently there is no metric that captures the duration of the workflow execution. It's also valuable to have the duration broken down by task queue, namespace, workflow type, which this PR enables ## How did you test it? - [X] built - [X] run locally and tested manually - [X] covered by existing tests - [X] added new unit test(s) - [ ] added new functional test(s) commit a1df862 Author: Lina Jodoin <[email protected]> Date: Wed Aug 13 15:30:48 2025 -0700 [CHASM Scheduler] Move scheduler protobufs to scheduler/proto package (#8189) ## What changed? - CHASM scheduler protos are moved to live alongside the scheduler implementation code, within the `chasm` package. ## Why? - See #8182. Sending this PR in advance, as that PR asserts protobufs were generated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) commit 8fe5cee Author: Sean Kane <[email protected]> Date: Thu Aug 14 00:10:06 2025 +0200 improvement: remove waits before fetching activities (#8144) ## What changed? optimize the batch operation processing in `BatchActivity` and `BatchActivityWithProtobuf` by removing the need to wait for entire pages to complete before fetching the next page. - Implemented proactive page fetching once a worker becomes available - common `processWorkflowsWithProactiveFetching` function to reduce code duplication ## Why? The previous implementation had workers wait for entire pages to complete. This optimization improves resource utilization. The refactoring also eliminates duplicated functions in the `BatchParams` struct and `BatchOperation` protobuf. Addresses issue #8098. ## How did you test it? - [x] built - [x] run locally and tested manually - [x] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) The changes maintain backward compatibility. ## Potential risks While this change improves performance, it does modify the concurrency model of batch processing: 1. **Timing changes**: The optimization changes when pages are fetched relative to task completion, which could expose edge cases in error handling or heartbeat timing 2. **Memory usage**: Pages may be fetched earlier, potentially increasing peak memory usage if the next page is large 3. **Rate limiting interaction**: The more aggressive task scheduling could interact differently with rate limiting, though the same per-worker limits are maintained 4. **Heartbeat behavior**: heartbeats track the progress of an entire page and are applied after an entire page finishes The changes preserve all existing error handling, retry logic, and rate limiting behavior, but the different execution timing could surface previously hidden race conditions. --------- Co-authored-by: Roey Berman <[email protected]> commit 469526e Author: pdoerner <[email protected]> Date: Wed Aug 13 09:58:03 2025 -0700 Change default for `component.nexusoperations.recordCancelRequestCompletionEvents` (#8191) ## What changed? Changed default for `component.nexusoperations.recordCancelRequestCompletionEvents` to `true` ## Why? Flag was added to ensure backwards compatibility. Now that 1.28 is released, can change the default. Flag will be removed after 1.29 is released. commit 69e6b6c Author: Rodrigo Zhou <[email protected]> Date: Tue Aug 12 11:31:27 2025 -0700 Bump Temporal API to v1.52.0 (#8187) ## What changed? Bump Temporal API to v1.52.0 ## Why? Bump Temporal API to v1.52.0 ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks commit da90f62 Author: Stephan Behnke <[email protected]> Date: Fri Aug 8 14:43:15 2025 -0700 Decode of nil data (#8179) ## What changed? Don't catch `Data: nil` in test; let it fall through to decoder. The decoder will return an error. An error is the better choice than a `nil` response since that signals to the user that the decoded data is usable/valid. ## Why? Follow-up to #8111; an internal test expects an error instead of `nil`. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Hard to believe that returning nil and using un-decoded data is a good/valid alternative. commit b8497fa Author: David Reiss <[email protected]> Date: Fri Aug 8 08:32:58 2025 -0700 Dynamic config conversion improvements (#7052) ## What changed? - Split implementation of "constrained default" settings from "plain default" settings. This is more code and the diff looks complex, but the individual paths are both simpler than the mixed version. - Add conversion cache using a weak map. - Remove GlobalCachedTypedValue. - Use "raw" values for subscription dispatch deduping to avoid unnecessary conversions. - Deep copy default values when using mapstructure, to avoid problems with merging over shared default values. ## Why? - Fixes #6756 - Performance improvement for "plain default" settings (almost all of them) - Performance improvement for settings with complex converters - Remove footgun in defaults that aren't scalar values ## How did you test it? existing+new unit tests commit f9bd083 Author: Lina Jodoin <[email protected]> Date: Thu Aug 7 17:11:53 2025 -0700 [Scheduled Actions] Update Scheduler protos for CHASM (#8163) ## What changed? - Added protos for the new Scheduler task types. - Added TODOs for cleanup when the HSM component is removed. ## Why? - A few fields and messages were made obsolete with the CHASM port. commit f8b97e5 Author: Vladyslav Simonenko <[email protected]> Date: Thu Aug 7 16:24:54 2025 -0700 Break out of pagination in scavenger on errors (#8133) ## What changed? Break out of the loop, when iteration through mutable states fails ## Why? Previously, we continued to iterate, leading to the panic: #8037 ## How did you test it? - [X] run locally and tested manually - [X] added new unit test(s)
What changed?
How did you test it?