Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -284,6 +284,12 @@ Endpoint. The pool has the following properties:
- **Rate-limited:** A Pool MUST limit the number of [Connections](#connection) being
[established](#establishing-a-connection-internal-implementation) concurrently via the **maxConnecting**
[pool option](#connection-pool-options).
- **Backpressure-enabled** - The pool MUST add the error labels `SystemOverloadedError` and `RetryableError` to network
errors or network timeouts it encounters during the connection establishment or the `hello` message. These labels
are used by the
[SDAM error handling](../server-discovery-and-monitoring/server-discovery-and-monitoring.md#error-handling-pseudocode)
to avoid clearing the pool. The pool MUST NOT add the backpressure error labels during an authentication step
after the `hello` message.

```typescript
interface ConnectionPool {
Expand Down Expand Up @@ -1375,6 +1381,8 @@ to close and remove from its pool a [Connection](#connection) which has unread e

## Changelog

- 2025-XX-YY: Add handling of backpressure error labels.

- 2025-01-22: Clarify durationMS in logs may be Int32/Int64/Double.

- 2024-11-27: Relaxed the WaitQueue fairness requirement.
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ failPoint:
mode: { times: 50 }
data:
failCommands: ["isMaster","hello"]
closeConnection: true
errorCode: 91
appName: "poolCreateMinSizeErrorTest"
poolOptions:
minPoolSize: 1
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -172,3 +172,39 @@ This test requires failCommand appName support which is only available in MongoD
5. Then verify that a ServerHeartbeatSucceededEvent and a ConnectionPoolReadyEvent (CMAP) are emitted.

6. Disable the failpoint.

## Connection Pool Backpressure

This test will be used to ensure that connection establishment failures during the TLS handshake do not result in a pool
clear event. We create a setup client to enable the ingress connection establishment rate limiter, and then induce a
connection storm. After the storm, we verify that some of the connections failed to checkout, but that the pool was not
cleared.

This test requires MongoDB 7.0+.

1. Create a setup client and run the following commands to set up the rate limiter:

```python
db = admin_client.admin
db.command("setParameter", 1, ingressConnectionEstablishmentRateLimiterEnabled=True)
db.command("setParameter", 1, ingressConnectionEstablishmentRatePerSec=20)
db.command("setParameter", 1, ingressConnectionEstablishmentBurstCapacitySecs=1)
db.command("setParameter", 1, ingressConnectionEstablishmentMaxQueueDepth=1)
```

2. Create a separate client that listens to CMAP events, with maxConnecting=100.

3. Add a document to the test collection so that the sleep operations will actually block:
`client.test.test.insert_one({})`.

4. Run the following find command on the collection in 10 parallel threads/coroutines:
`client.test.test.find_one({"$where": "function() { sleep(2000); return true; }})`

5. Run the same find command on the collection in 100 parallel threads/coroutines.

6. Assert that at least 10 ConnectionCheckOutFailedEvents occurred.

7. Assert that 0 PoolClearedEvents occurred.

8. Ensure that the following command runs at test teardown even if the test fails:
`admin_client.admin("setParameter", 1, ingressConnectionEstablishmentRateLimiterEnabled=False)`.
Original file line number Diff line number Diff line change
Expand Up @@ -434,18 +434,18 @@ correspond to [replica set member states](https://www.mongodb.com/docs/manual/re
some replica set member states like STARTUP and RECOVERING are identical from the client's perspective, so they are
merged into "RSOther". Additionally, states like Standalone and Mongos are not replica set member states at all.

| State | Symptoms |
| --------------- | ------------------------------------------------------------------------------------------------------------------------- |
| Unknown | Initial, or after a network error or failed hello or legacy hello call, or "ok: 1" not in hello or legacy hello response. |
| Standalone | No "msg: isdbgrid", no setName, and no "isreplicaset: true". |
| Mongos | "msg: isdbgrid". |
| PossiblePrimary | Not yet checked, but another member thinks it is the primary. |
| RSPrimary | "isWritablePrimary: true" or "ismaster: true", "setName" in response. |
| RSSecondary | "secondary: true", "setName" in response. |
| RSArbiter | "arbiterOnly: true", "setName" in response. |
| RSOther | "setName" in response, "hidden: true" or not primary, secondary, nor arbiter. |
| RSGhost | "isreplicaset: true" in response. |
| LoadBalanced | "loadBalanced=true" in URI. |
| State | Symptoms |
| --------------- | -------------------------------------------------------------------------------------------------------- |
| Unknown | Initial, or after a failed hello or legacy hello call, or "ok: 1" not in hello or legacy hello response. |
| Standalone | No "msg: isdbgrid", no setName, and no "isreplicaset: true". |
| Mongos | "msg: isdbgrid". |
| PossiblePrimary | Not yet checked, but another member thinks it is the primary. |
| RSPrimary | "isWritablePrimary: true" or "ismaster: true", "setName" in response. |
| RSSecondary | "secondary: true", "setName" in response. |
| RSArbiter | "arbiterOnly: true", "setName" in response. |
| RSOther | "setName" in response, "hidden: true" or not primary, secondary, nor arbiter. |
| RSGhost | "isreplicaset: true" in response. |
| LoadBalanced | "loadBalanced=true" in URI. |

A server can transition from any state to any other. For example, an administrator could shut down a secondary and bring
up a mongos in its place.
Expand Down Expand Up @@ -1056,6 +1056,9 @@ def handleError(error):
if isNotWritablePrimary(error):
check failing server
elif isNetworkError(error) or (not error.completedHandshake and (isNetworkTimeout(error) or isAuthError(error))):
# Ignore errors that have a backpressure error label applied.
if error.hasLabel("SystemOverloadedError"):
continue
if type != LoadBalanced
# Mark the server Unknown
unknown = new ServerDescription(type=Unknown, error=error)
Expand Down Expand Up @@ -1139,16 +1142,20 @@ errors, network timeout errors, state change errors, and authentication errors.

##### Network error when reading or writing

To describe how the client responds to network errors during application operations, we distinguish two phases of
To describe how the client responds to network errors during application operations, we distinguish three phases of
connecting to a server and using it for application operations:

- *Before the handshake completes*: the client establishes a new connection to the server and completes an initial
handshake by calling "hello" or legacy hello and reading the response, and optionally completing authentication
- *Connection establishment and hello*: the client establishes a new connection to the server and completes an initial
handshake by calling "hello" or legacy hello and reading the response
- *Authentication step*: the client optionally completes an authentication step
- *After the handshake completes*: the client uses the established connection for application operations

If there is a network error or timeout on the connection before the handshake completes, the client MUST replace the
server's description with a default ServerDescription of type Unknown when the TopologyType is not LoadBalanced, and
fill the ServerDescription's error field with useful information.
If there is a network error or timeout on the connection establishment or the hello, the client MUST NOT change the
server's description.

If there is an network error or timeout during the authentication step,, the client MUST replace the server's
description with a default ServerDescription of type Unknown when the TopologyType is not LoadBalanced, and fill the
ServerDescription's error field with useful information.

If there is a network error or timeout on the connection before the handshake completes, and the TopologyType is
LoadBalanced, the client MUST keep the ServerDescription as LoadBalancer.
Expand Down Expand Up @@ -1255,9 +1262,10 @@ and [other transient errors](#other-transient-errors) and

##### Authentication and Handshake errors

If the driver encounters errors when establishing application connections (this includes the initial handshake and
authentication), the driver MUST mark the server Unknown and clear the server's connection pool if the TopologyType is
not LoadBalanced. (See [Why mark a server Unknown after an auth error?](#why-mark-a-server-unknown-after-an-auth-error))
If the driver encounters errors that do not have the backpressure error label (`SystemOverloadedError`) applied when
establishing application connections (this includes the initial handshake and authentication), the driver MUST mark the
server Unknown and clear the server's connection pool if the TopologyType is not LoadBalanced. (See
[Why mark a server Unknown after an auth error?](#why-mark-a-server-unknown-after-an-auth-error))

### Monitoring SDAM events

Expand Down Expand Up @@ -2027,6 +2035,8 @@ oversaw the specification process.
- 2025-01-22: Add error messages when a new primary is elected or a primary with a stale electionId or setVersion is
discovered.

- 2025-XX-YY: Add handling of backpressure error labels.

______________________________________________________________________

[^1]: "localThresholdMS" was called "secondaryAcceptableLatencyMS" in the Read Preferences Spec, before it was superseded
Expand Down

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading