-
Notifications
You must be signed in to change notification settings - Fork 1.3k
CSHARP-3550: CSOT: Server Selection #1705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
||
public TimeSpan Elapsed => _stopwatch.Elapsed; | ||
|
||
public TimeSpan Timeout { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need TWO properties:
public TimeSpan OperationTimeout { get; }
public TimeSpan StepTimeout { get; }
to keep track of BOTH kinds of timeouts.
That probably also means we need two stopwatches. One for the operation as a whole, and the other for whatever sub step we are in (server selection, etc..).
The reason I say this is that when a timeout occurs we should be able to say in the message whether the whole operation timed out or whether a sub step timed out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting idea. What about having single timeout as is and ParentContext
property instead? Somehow I would like to keep the Context kind of immutable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a ParentContext
might be a good idea in its own right.
You still need a way to detect when a timeout occurs whether it was the OperationTimeout
or the StepTimeout
that timed out. Does a ParentContext
help you with that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pseudo-code to check which the timeout occurred:
if (operationContext.IsTimedOut())
{
var isLocalTimeout = operationContext?.ParentContext.IsTimedOut() == false;
}
So the logic is: timeout is "local" when there is no parent context or parent context is not timed out yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code works, because child context is always more strict then parent. It means there should not be a case when ParentContext is timed out, but child is not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a way we are talkinig about the same thing, since a context represents (among other things) a timeout.
d) whole operation is timed out, but server selection is not - this is impossible state as per spec.
I still think this is possible. I just have to configure the operation timeout to be smaller than the server selection timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spec says we should use more strict timeout in such situation:
If timeoutMS is set, drivers MUST use min(serverSelectionTimeoutMS, remaining timeoutMS), referred to as computedServerSelectionTimeout as the timeout for server selection and connection checkout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we aren't talking about the same thing again.
I'm saying that the operation timeout can be less than the server selection timeout. Of course I can do that.
That's not the same thing as the "computedServerSelectionTimeout".
When the "computedServerSelectionTimeout" (NOT a configured value, but a value that is computed and is different evry time) times out, we probably need to know WHY it timed out. Did it timeout because the operation timed out or because server selection timed out? I think the error message should say why. Seems like the user would want to know which timeout timed out because they may want to know which timeout value to reconfigure if they actually didn't want to time out so quickly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
d) whole operation is timed out, but server selection is not - this is impossible state as per spec.
I don't think the spec says this is impossible. The spec describes how to calculate the "computedServerSelectionTimeout", but if the configured operation timeout is less than the configured server selection timeout and we timeout during server selection it is NOT the server selection that timed out, but the WHOLE operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it was discussed offline:
We both were talking about the same thing, but there were small confusion regarding "configured" timeout and "computed" timeout. We confirmed that we DO need to distinguish what exactly timeout was occurred: was it "local" timeout or whole operation timeout. Also we decided to go with ParentContext
approach as it seems to cover our needs and also let us have more the 2 level of nesting. To check whether it was local timeout occurred or whole operation - we have to examine the ParentContext
property and see if it was timed out or not. If there is no ParentContext
and the context is timed out - it means whole operation was timed out.
{ | ||
_stopwatch = stopwatch; | ||
Timeout = timeout; | ||
CancellationToken = cancellationToken; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend not mixing backing fields and auto-properties in the same class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
return remainingTimeout < TimeSpan.Zero; | ||
} | ||
|
||
public OperationCancellationContext WithTimeout(TimeSpan timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe should be called WithStepTimeout
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it was discussed we will not introduce additional properties for now, so will leave the method name as is to match the property name.
|
||
namespace MongoDB.Driver | ||
{ | ||
internal sealed class OperationCancellationContext |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OperationContext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
public TimeSpan Elapsed => _stopwatch.Elapsed; | ||
|
||
public TimeSpan Timeout { get; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a ParentContext
might be a good idea in its own right.
You still need a way to detect when a timeout occurs whether it was the OperationTimeout
or the StepTimeout
that timed out. Does a ParentContext
help you with that?
@@ -47,7 +47,7 @@ private Exception CreateTimeoutException(Stopwatch stopwatch, string message) | |||
var checkOutsForOtherCount = checkOutsCount - checkOutsForCursorCount - checkOutsForTransactionCount; | |||
|
|||
message = | |||
$"Timed out after {stopwatch.ElapsedMilliseconds}ms waiting for a connection from the connection pool. " + | |||
$"Timed out after {operationContext.Elapsed.TotalMilliseconds}ms waiting for a connection from the connection pool. " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Looks like Elapsed.TotalMilliseconds
is frequently used, does this award a shortcut property:
ElapsedMilleseconds
?
// TODO: this static field is temporary here and will be removed in a future PRs in scope of CSOT. | ||
public static readonly OperationContext NoTimeout = new(System.Threading.Timeout.InfiniteTimeSpan, CancellationToken.None); | ||
|
||
private readonly Stopwatch _stopwatch; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we are saving the multiple stopwatches creation on operation execution path, it's minor but still nice.
src/MongoDB.Driver/Core/ConnectionPools/ExclusiveConnectionPool.Helpers.cs
Show resolved
Hide resolved
[Values(false, true)] | ||
bool async) | ||
{ | ||
var subject = CreateSubject(); | ||
var subject = CreateSubject(serverSelectionTimeout: TimeSpan.FromMilliseconds(10)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adjusted serverSelectionTimeout
is an optimization of the test: we probably do not have to wait whole 2 seconds here (the default server selection for this test class, see line 49).
@@ -134,23 +134,13 @@ await Record.ExceptionAsync(() => subject.ExecuteWriteOperationAsync(null, opera | |||
.Subject.ParamName.Should().Be("session"); | |||
} | |||
|
|||
private OperationExecutor CreateSubject(out Mock<IClusterInternal> clusterMock, out Mock<ICoreSessionHandle> implicitSessionMock) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftovers of the implicit session creation, that was factored out in one of the latest commits related to OperationExecutor refactoing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors various server selection and connection pool APIs to use a new OperationContext
abstraction instead of separate CancellationToken
and timeout TimeSpan
parameters. Key changes include:
- Replaced all
CancellationToken
(and internalStopwatch
timeout logic) withOperationContext
throughout connection pool helpers, cluster selection, and binding classes. - Updated
CreateTimeoutException
methods to accept aTimeSpan elapsed
rather than aStopwatch
. - Removed legacy overloads and helper classes associated with
CancellationToken
/TimeSpan
and simplified server selection wait logic usingOperationContext
.
Reviewed Changes
Copilot reviewed 141 out of 141 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
src/MongoDB.Driver/Core/ConnectionPools/ExclusiveConnectionPool.Helpers.cs | Swapped Stopwatch usage and token/timeouts for OperationContext |
src/MongoDB.Driver/Core/Clusters/LoadBalancedCluster.cs | Updated server selection to use OperationContext and WithTimeout |
src/MongoDB.Driver/Core/Clusters/IClusterExtensions.cs | Changed extension methods to accept OperationContext |
src/MongoDB.Driver/Core/Clusters/ICluster.cs | Updated interface signatures to use OperationContext |
src/MongoDB.Driver/Core/Clusters/Cluster.cs | Refactored internal select/wait loops to data in OperationContext |
src/MongoDB.Driver/Core/Bindings/**/*.cs | Replaced all CancellationToken parameters in binding APIs with OperationContext |
Comments suppressed due to low confidence (3)
src/MongoDB.Driver/Core/ConnectionPools/ExclusiveConnectionPool.Helpers.cs:50
- [nitpick] Using TotalMilliseconds can include fractional values (e.g., 12.345ms). Consider formatting with a fixed precision or rounding (e.g.,
{elapsed.TotalMilliseconds:F0}
) to produce a cleaner integer-based message.
message = $"Timed out after {elapsed.TotalMilliseconds}ms waiting for a connection from the connection pool. " +
src/MongoDB.Driver/Core/Clusters/LoadBalancedCluster.cs:179
- The new
WithTimeout
logic andWaitTask
behavior are significant changes to server selection. Please add unit tests that simulate both timeout and cancellation scenarios forSelectServer
/SelectServerAsync
using mocks/stubs for_serverReadyTaskCompletionSource
.
var serverSelectionOperationContext = operationContext.WithTimeout(_settings.ServerSelectionTimeout);
src/MongoDB.Driver/Core/Clusters/ICluster.cs:64
- Removing the old
CancellationToken
overloads is a breaking API change. Consider retaining the old overloads (marked obsolete) to preserve backward compatibility, or provide clear upgrade guidance in release notes.
IServer SelectServer(IServerSelector selector, OperationContext operationContext);
if (index != 0) | ||
try | ||
{ | ||
serverSelectionOperationContext.WaitTask(_serverReadyTaskCompletionSource.Task); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code explicitly threw on cancellation (cancellationToken.ThrowIfCancellationRequested()
), but this method only catches TimeoutException
. Consider catching OperationCanceledException
from WaitTask
or explicitly checking operationContext.CancellationToken
to ensure cancellations are surfaced correctly.
Copilot uses AI. Check for mistakes.
// should be pinned. | ||
var server = cluster.SelectServer(selector, cancellationToken); | ||
var server = cluster.SelectServer(selector, operationContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] We added non-null checks for selector
and operationContext
in core cluster methods; consider adding Ensure.IsNotNull(operationContext, nameof(operationContext));
here as well to maintain consistent validation.
Copilot uses AI. Check for mistakes.
{ | ||
return GetReadChannelSource(null, cancellationToken); | ||
return GetReadChannelSource(null, operationContext); | ||
} | ||
|
||
public Task<IChannelSourceHandle> GetReadChannelSourceAsync(CancellationToken cancellationToken) | ||
public Task<IChannelSourceHandle> GetReadChannelSourceAsync(OperationContext operationContext) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Most public API methods now guard against null operationContext
. Add an Ensure.IsNotNull(operationContext, nameof(operationContext));
at the top of this method to match the pattern in cluster selection.
Copilot uses AI. Check for mistakes.
{ | ||
ThrowIfDisposed(); | ||
return Task.FromResult(GetChannelSourceHelper()); | ||
} | ||
|
||
public IChannelSourceHandle GetReadChannelSource(IReadOnlyCollection<ServerDescription> deprioritizedServers, CancellationToken cancellationToken) | ||
public IChannelSourceHandle GetReadChannelSource(IReadOnlyCollection<ServerDescription> deprioritizedServers, OperationContext operationContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think operationContext
should always be the first parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same everywhere. I stopped adding this comment after a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Here and in all other places.
@@ -145,6 +145,8 @@ public bool IsInTransaction | |||
{ | |||
EnsureAbortTransactionCanBeCalled(nameof(AbortTransaction)); | |||
|
|||
// TODO: CSOT implement proper way to obtain the operationCancellationContext | |||
var operationCancellationContext = new OperationContext(Timeout.InfiniteTimeSpan, cancellationToken); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variable name should be operationContext
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -537,21 +545,21 @@ private void EnsureTransactionsAreSupported() | |||
} | |||
} | |||
|
|||
private TResult ExecuteEndTransactionOnPrimary<TResult>(IReadOperation<TResult> operation, CancellationToken cancellationToken) | |||
private TResult ExecuteEndTransactionOnPrimary<TResult>(IReadOperation<TResult> operation, OperationContext operationContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is true that cancellationToken
is always the last parameter, I think that operationContext
should always be the first parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
IChannelSourceHandle GetReadChannelSource(IReadOnlyCollection<ServerDescription> deprioritizedServers, CancellationToken cancellationToken); | ||
Task<IChannelSourceHandle> GetReadChannelSourceAsync(IReadOnlyCollection<ServerDescription> deprioritizedServers, CancellationToken cancellationToken); | ||
IChannelSourceHandle GetReadChannelSource(IReadOnlyCollection<ServerDescription> deprioritizedServers, OperationContext operationContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like operationContext
should always be the first parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -62,8 +61,8 @@ internal interface IClusterInternal : ICluster | |||
|
|||
void Initialize(); | |||
|
|||
IServer SelectServer(IServerSelector selector, CancellationToken cancellationToken); | |||
Task<IServer> SelectServerAsync(IServerSelector selector, CancellationToken cancellationToken); | |||
IServer SelectServer(IServerSelector selector, OperationContext operationContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like operationContext
should always be the first parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
} | ||
catch (TimeoutException) | ||
{ | ||
var message = BuildTimeoutExceptionMessage(_settings.ServerSelectionTimeout, selector, helper.Description); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BuildTimeoutExceptionMessage
doesn't inform the user whether the operation timed out or the server selection timed out.
@@ -29,7 +28,7 @@ public static IServer SelectServerAndPinIfNeeded( | |||
ICoreSessionHandle session, | |||
IServerSelector selector, | |||
IReadOnlyCollection<ServerDescription> deprioritizedServers, | |||
CancellationToken cancellationToken) | |||
OperationContext operationContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like operationContext
should always be the "first" parameter. In the case of an extension method that would be the "second" parameter, which "looks" like the first parameter when called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -270,27 +269,27 @@ public bool? UseCursor | |||
|
|||
// methods | |||
/// <inheritdoc/> | |||
public IAsyncCursor<TResult> Execute(IReadBinding binding, CancellationToken cancellationToken) | |||
public IAsyncCursor<TResult> Execute(IReadBinding binding, OperationContext operationContext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the operationContext
should always be the first parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
{ | ||
var connection = connectionCreator.CreateOpened(cancellationToken); | ||
var connection = connectionCreator.CreateOpened(operationContext); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for tiny 20ms timeout is just try to take turn in the connecting queue without blocking checkouts.
This timeout is applied to every step individually: MaxConnections and MaxConnecting.
Seems that this code changes this and allocates 20ms to
- MaxConnections
- MaxConnecting
- Whole Connection creation operation (excluding open)
If this thought is correct, it's really surprising that this was not caught by tests.
try | ||
{ | ||
StartCheckingOut(stopwatch); | ||
_poolQueueWaitResult = _pool._maxConnectionsQueue.WaitSignaled(_timeout, cancellationToken); | ||
StartCheckingOut(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like our previous ConnectionCheckedOutEvent
duration calculation was a bit different from spec:
It didn't include emitting (and processing) the CheckoutStarted
event.
Probably we should mention this nuance in release notes, as the duration will include client events processing code.
_poolQueueWaitResult switch | ||
{ | ||
SemaphoreSlimSignalable.SemaphoreWaitResult.Signaled => | ||
MongoConnectionPoolPausedException.ForConnectionPool(_pool._endPoint), | ||
SemaphoreSlimSignalable.SemaphoreWaitResult.TimedOut => | ||
_pool.CreateTimeoutException(stopwatch, $"Timed out waiting for a connection after {stopwatch.ElapsedMilliseconds}ms."), | ||
_pool.CreateTimeoutException(operationContext.Elapsed, $"Timed out waiting for a connection after {operationContext.Elapsed.TotalMilliseconds}ms."), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we'd get the timeout closer to WaitSignaled result. But I don't see a very simple solution for this.
_pool._poolState.ThrowIfNotReady(); | ||
|
||
if (_connectingWaitStatus == SemaphoreSlimSignalable.SemaphoreWaitResult.TimedOut) | ||
{ | ||
_pool.CreateTimeoutException(stopwatch, $"Timed out waiting for in connecting queue after {stopwatch.ElapsedMilliseconds}ms."); | ||
_pool.CreateTimeoutException(operationContext.Elapsed, $"Timed out waiting for in connecting queue after {operationContext.Elapsed.TotalMilliseconds}ms."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is not accurate, Elapsed includes more waits than just the maxConnecting queue.
SemaphoreSlimSignalable.SemaphoreWaitResult.Entered => CreateOpenedInternal(cancellationToken), | ||
SemaphoreSlimSignalable.SemaphoreWaitResult.TimedOut => throw CreateTimeoutException(stopwatch), | ||
SemaphoreSlimSignalable.SemaphoreWaitResult.Entered => CreateOpenedInternal(operationContext), | ||
SemaphoreSlimSignalable.SemaphoreWaitResult.TimedOut => throw CreateTimeoutException(operationContext.Elapsed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue, the exception is just for maxConnecting
queue accumulative waits.
Maybe we need to understand what is prescribed by spec here.
} | ||
|
||
[Fact] | ||
public void RenainingTimeout_could_be_negative() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
type: remaining
No description provided.