DRIVERS-1934: withTransaction API retries too frequently #1851

sleepyStick · 2025-10-17T23:04:58Z

Please complete the following before merging:

Is the relevant DRIVERS ticket in the PR title?

Update changelog.
Test changes in at least one language driver.
Test these changes against all server versions and topologies (including standalone, replica set, and sharded
clusters).

sleepyStick · 2025-10-17T23:06:30Z

source/transactions-convenient-api/transactions-convenient-api.md

 callbacks.

-A previous design had no limits for retrying commits or entire transactions. The callback is always able indicate that
+A previous design had no limits for retrying commits or entire transactions. The callback is always able to indicate that


Maybe I read it wrong, or maybe its a typo? Not sure.

sleepyStick · 2025-10-17T23:08:32Z

source/transactions-convenient-api/transactions-convenient-api.md

        [abortTransaction](../transactions/transactions.md#aborttransaction) on the session.
    2. If the callback's error includes a "TransientTransactionError" label and the elapsed time of `withTransaction` is
-        less than 120 seconds, jump back to step two.
+        less than 120 seconds, sleep for `jitter * min(BACKOFF_INITIAL * (1.25**retry), BACKOFF_MAX)` where: 


I know python uses ** as the exponential operator. I don't believe that is a standard across other programming languages. Is it clear that its exponent? Is there a different preferred symbol for exponent? (I know math commonly uses ^ but it typically also means bitwise XOR in code so I felt like that could be confusing.)

** seems fine to me, but I might also be biased because that's what Node does 🙂. But I don't think people will have trouble understanding what this means.

ShaneHarvey · 2025-10-20T20:47:35Z

source/transactions-convenient-api/tests/README.md

+### Retry Backoff is Enforced
+
+Drivers should test that retries within `withTransaction` do not occur immediately. Ideally, set BACKOFF_INITIAL 500ms
+and configure a failpoint that forces one retry. Ensure that the operation took more than 500ms so succeed.


This test would not work as described because jitter is non-deterministic.

An alternative test would be to use a failpoint to fail the transaction X times and then assert the overall time is larger than some threshold.

Yeah, I realized that when implementing it. I set the fail point to fail 3 times and that seems to be consistently working (and without jitter failing 3 times would still cause the success to happen within 500ms)

update: it was consistent locally but super flakey on atlas. I modified the with transaction code to append backoff values to make this easier to test and the test is now more stable.

source/transactions-convenient-api/transactions-convenient-api.md

sleepyStick · 2025-10-23T17:30:03Z

source/transactions-convenient-api/tests/README.md


+### Retry Backoff is Enforced
+
+Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3


3 here was a bit of an arbitrary number. The most important part of this value is that its greater than 1
3 just felt like a small enough to be a quick test but big enough to conclude backoff is consistently happening.
If folks have more opinions on this number, I'm not attached to 3.

I'm not sure the _transaction_retry_backoffs concept is viable for testing here since:

many languages don't have a way to implement it without making the attribute public

It's an unbounded list that can grow forever.

there's no prior art for something like this in the driver as far as I know.

Instead I'd suggest my test from earlier where we fail the transaction X times and assert the run time is greater than some threshold T. X should be large enough to reduce false positives where the test fails due to jitter resulting in a small delay for every retry.

We can calculate T by recording the command failed+succeeded events, summing their duration, and adding a fixed constant.

Makes sense, it took me a bit but i figured out what X and T are for python -- I don't know if they'll be the same for the other languages tho? Should the test description leave it as X and T? or should I put in the numbers that I have and see what others have to say about it?

An alternative I'm considering in Node is to configure the random number generator to be deterministic for testing purposes. Because this would make the tests both simpler to reason about and deterministic. ex: make random() always return 1, then we can make assertions on the timing of retries deterministically

Ohhh! I like that idea! Would you want that to replace the current proposed test idea, or is this in addition to the previously defined test?
I'm imaging this as a second test where we fail the transactions a handful of times and calculate the backoff times assuming random() always returns 1 and ensure that the transaction takes longer than the sum of the backoff times.

The main reason to keep the first test (where we don't configure random) is to just ensure that jitter is applied? If so, then should that first test be modified to assert that the transaction succeeds in < T where T is the sum of the maximum backoff times?

Then together we're effectively checking a minimum and maximum threshold on the backoff? Did that make sense?

It depends exactly what we think we need to test here, I think. If the goal is only to test that retry backoff is enforced, I think your current implementation works.

Although I'd be worried about flakiness, because anything that relies on random values is inherently non-determinstic and just due to the distribution of the delays it seems like flakes might be likely:

assuming jitter = 1, the max delay for the test is 3064ms

the last 4 retries account for almost 1750ms of delay in the test (330.8ms, 413.5ms, 500ms, 500ms)

So, short delays in the last few retries could make make the test run a lot shorter than expected. A determinsitic jitter would solve this problem. That, and the test is a bit simpler to reason about imo (I spent a bit of time trying to calculate the probability that this test fails assuming random() returns an equal distribution of values in [0,1], which goes away if we random() is deterministic).

I like the idea of keeping both tests if we do want to make sure jitter is applied, but a determinsitic alternative would be to make random() return a non-1 value (like .5) and assert that the test takes between [total sleep time with jitter = .5, total sleep time with jitter = 1 (or some other value)].

I also know this is feasible in Node, I'm not sure how feasible configuring the random() function would be in other drivers.

I think theoretically the test described by shane works but I agree that flakiness is likely an issue. I tried to account for that with the "optionally change initial_backoff to a higher value" but I don't know how feasible that is across various drivers (hence the optional)

As you've stated, the effect of jitter on the later attempt have a higher impact than the earlier ones, but I generally hope that over time the avg of random should be ~0.5 which sanity checked my initial 1.25 second timeout (discovered through guess and check for better or worse) with random jitter -- but clearly we've been observing that it's not consistent enough across languages

Noting my journey in trying to get values that were consistent in python and seeing that they don't carry over to Node seems to imply that if this test were to actually be implemented, each driver may have to find their own values of X and T? which feels silly imo?

All of that is to say, I think at this point i'm convinced a deterministic approach is the way to go for tests. Just to be clear, you are suggesting that the two test be:

random always returns 1 and assert that the test takes more than total sleep time (with jitter = 1)

random always returns non-1 value, a and assert that the test takes between [total sleep time with jitter = b, total sleep time with jitter = c (or some other value)].
does a = b or can it be such that b <= a? obviously c > a, correct?
how do we want to decide on a, b, and c? There is a part of me that worries if the time difference between total sleep time with c vs total sleep time with a isn't big enough, the rest of the driver operations could total test time > total sleep time with jitter = c? idk i think i'm rambling now.

ShaneHarvey · 2025-10-24T17:48:13Z

source/transactions-convenient-api/tests/README.md


+### Retry Backoff is Enforced
+
+Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3


I'm not sure the _transaction_retry_backoffs concept is viable for testing here since:

many languages don't have a way to implement it without making the attribute public

It's an unbounded list that can grow forever.

there's no prior art for something like this in the driver as far as I know.

Instead I'd suggest my test from earlier where we fail the transaction X times and assert the run time is greater than some threshold T. X should be large enough to reduce false positives where the test fails due to jitter resulting in a small delay for every retry.

We can calculate T by recording the command failed+succeeded events, summing their duration, and adding a fixed constant.

source/transactions-convenient-api/transactions-convenient-api.md

ShaneHarvey · 2025-11-13T23:12:15Z

source/transactions-convenient-api/tests/README.md

+    },
+    "data": {
+        "failCommands": ["commitTransaction"],
+        "errorCode": 24,


Can we use this error code instead for these tests?:

code: 251, name: NoSuchTransaction

Changed in f40529f

ShaneHarvey · 2025-11-13T23:19:31Z

source/transactions-convenient-api/tests/README.md

+```
+
+Let the callback for the transaction be a simple `insertOne` command. Check that the total time for all retries exceeded
+3.5 seconds.


These two tests are identical. What if we run this test twice, first with backoff disabled (eg jitter set to always be 0%) and second with it enabled and jitter set to always be 100%. Then we can assert the second attempt takes 3 seconds longer?

Ooo i like this idea! Changed in f40529f

I decreased the number of failures to 13 so the test wouldn't take as long (resulting in total backing time to be 2.2 seconds) and added a +/- 1 second window for the difference between the two attempts.

…f takes longer

ShaneHarvey · 2025-11-18T21:05:41Z

source/transactions-convenient-api/tests/README.md

+```
+
+Let the callback for the transaction be a simple `insertOne` command. Let `no_backoff_time` be the time it took for the
+command to succeed.


"command" -> "withTransaction call"

changed in 56e88f0

ShaneHarvey · 2025-11-18T21:09:12Z

source/transactions-convenient-api/tests/README.md

+return `1`. Set the fail point to force 13 retries using the same command as before. Using the same callback as before,
+check that the total time for the withTransaction command is within +/-1 second of `no_backoff_time` plus 2.2 seconds.
+Note that 2.2 seconds is the sum of backoff 13 consecutive backoff values and the 1-second window is just to account for
+potential networking differences between the two runs.


minor nit for tone:

"the 1-second window is just to account for
potential networking differences between the two runs."
->
"the 1-second window accounts for potential variance between the two runs.

changed in 56e88f0

ShaneHarvey · 2025-11-18T21:10:54Z

source/transactions-convenient-api/transactions-convenient-api.md

 This method can be expressed by the following pseudo-code:

 ```typescript
+var BACKOFF_INITIAL = 1  // 1ms initial backoff


Let's update this to use 5ms as described in the tech doc. And the growth rate as well.

changed in 56e88f0

source/transactions-convenient-api/transactions-convenient-api.md

source/transactions-convenient-api/tests/README.md

baileympearson · 2025-11-19T22:56:48Z

source/transactions-convenient-api/transactions-convenient-api.md

 withTransaction(callback, options) {
    // Note: drivers SHOULD use a monotonic clock to determine elapsed time
    var startTime = Date.now(); // milliseconds since Unix epoch
+    var TIMEOUT_MS = timeoutMS is None ? 120000 : TIMEOUT_MS


I didn't actually realize this pseudocode is my language 😅. None is not valid TS. I also forgot that it's slightly more complicated; a timeoutMS of 0 = no timeout. So it's actually gotta look something like:

Suggested change

var TIMEOUT_MS = timeoutMS is None ? 120000 : TIMEOUT_MS

var TIMEOUT_MS = typeof timeoutMS === 'number'

? timeoutMS === 0

? Infinity

: timeoutMS

: 120;

aand i'm now just remembering that it's additionally complicated, because we can inherit the timeoutMS from the session.

I'll need to give this dedicated focus tomorrow just to make sure its right lol. But if you have a chance, want to give it a shot? It probably requires cross referencing the CSOT spec for timeout rules for withTransaction with the pseudocode here and the convenient API spec.

LOL i did find it a bit odd that "pseudocode" seemed to basically just be TS in this doc but i'll take a closer look -- i'm not super familiar with CSOT spec so i'll have to give that a closer read. I'll drop anything I find here tho

baileympearson · 2025-11-19T23:02:39Z

source/transactions-convenient-api/tests/README.md

+        > [!NOTE]
+        > errorCode 251 is NoSuchTransaction.


This isn't formatting for me? Maybe it's something to do with the indentation?

hmm wacky? it seemed to render in my preview in pycharm when I was editing this so i assumed it worked. I'll play around with the indentation after looking into CSOT. worse case i'll be boring and just have this note be normal text lol

source/transactions-convenient-api/tests/README.md

add exponential backoff to withTransactions API

25759c4

sleepyStick commented Oct 17, 2025

View reviewed changes

run pre-commit

bdcd2ef

ShaneHarvey requested changes Oct 20, 2025

View reviewed changes

sleepyStick added 2 commits October 20, 2025 16:07

add design rational for backoff

48890a2

fix prose test

71ba1ba

sleepyStick commented Oct 23, 2025

View reviewed changes

sleepyStick requested a review from ShaneHarvey October 23, 2025 17:30

ShaneHarvey requested changes Oct 24, 2025

View reviewed changes

baileympearson requested changes Oct 27, 2025

View reviewed changes

source/transactions-convenient-api/transactions-convenient-api.md Outdated Show resolved Hide resolved

sleepyStick added 2 commits October 27, 2025 15:07

fix pseudocode

b602606

fix test

057fbbf

baileympearson reviewed Oct 29, 2025

View reviewed changes

source/transactions-convenient-api/transactions-convenient-api.md Show resolved Hide resolved

sleepyStick added 3 commits October 29, 2025 15:38

account for CSOT / timeoutMS in algorithm

42e4d94

add more details to tests

c11aef8

add second test that is deterministic

a6b7b95

sleepyStick marked this pull request as ready for review November 12, 2025 21:44

sleepyStick requested a review from a team as a code owner November 12, 2025 21:44

sleepyStick requested review from alcaeus and removed request for a team November 12, 2025 21:44

alcaeus requested review from durran and removed request for alcaeus November 13, 2025 08:18

sleepyStick requested review from ShaneHarvey and baileympearson November 13, 2025 21:59

ShaneHarvey reviewed Nov 13, 2025

View reviewed changes

change test to use no backoff as baseline time and ensure with backof…

f40529f

…f takes longer

sleepyStick requested a review from ShaneHarvey November 14, 2025 00:49

ShaneHarvey requested changes Nov 18, 2025

View reviewed changes

baileympearson requested changes Nov 18, 2025

View reviewed changes

sleepyStick added 2 commits November 18, 2025 16:43

address comments pt 1

56e88f0

address comments pt 2

bc9153e

sleepyStick requested review from ShaneHarvey and baileympearson November 19, 2025 01:42

baileympearson requested changes Nov 19, 2025

View reviewed changes


		### Retry Backoff is Enforced

		Drivers should test that retries within `withTransaction` do not occur immediately. Configure a fail point that forces 3

-    var TIMEOUT_MS = timeoutMS is None ? 120000 : TIMEOUT_MS
+    var TIMEOUT_MS = typeof timeoutMS === 'number'
+      ? timeoutMS === 0
+        ? Infinity
+        : timeoutMS
+      : 120;

DRIVERS-1934: withTransaction API retries too frequently #1851

Are you sure you want to change the base?

DRIVERS-1934: withTransaction API retries too frequently #1851

Conversation

sleepyStick commented Oct 17, 2025 • edited by ShaneHarvey Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sleepyStick Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sleepyStick Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sleepyStick commented Oct 17, 2025 •

edited by ShaneHarvey

Loading

sleepyStick Oct 23, 2025 •

edited

Loading

sleepyStick Nov 13, 2025 •

edited

Loading