-
Notifications
You must be signed in to change notification settings - Fork 1.1k
PYTHON-5536 Avoid clearing the connection pool when the server connection rate limiter triggers #2509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: backpressure
Are you sure you want to change the base?
Conversation
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…b#2507) Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…ction rate limiter triggers
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Steven Silvester <[email protected]>
pymongo/asynchronous/pool.py
Outdated
conn.conn.get_conn.read(1) | ||
except Exception as _: | ||
# TODO: verify the exception | ||
close_conn = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 comments:
- I believe this logic needs to move to connection checkout. Here in connection check in we already know the connection is useable because we're checking it back in after a successful command.
- Instead of a 1ms read can we reuse the existing _perished() + conn_closed() methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
(cherry picked from commit 0d4c84e)
…MiB error codes (mongodb#2515) (cherry picked from commit c0e0554)
This reverts commit 532c1b8.
pymongo/asynchronous/pool.py
Outdated
if not self.is_sdam and type(e) == AutoReconnect: | ||
self._backoff += 1 | ||
e._add_error_label("SystemOverloaded") | ||
e._add_error_label("Retryable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to move this logic so that it covers the TCP+TLS handshake which happen up above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I set a breakpoint in the TCP+TLS handshake error handler and confirmed that handshakes are succeeding. The error only occurs on hello/auth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I'm actually surprised by this since the design SPM-4319 indicates the rate limiter rejection happens before the TLS handshake.
Ideally we'd like to detect |
else: | ||
if self._closing_exception: | ||
raise self._closing_exception | ||
if self._closed.done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is calling is_closing
here better? It'll catch more edge cases in theory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm let me try that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it is ambiguous as to whether connection_lost
as been called yet. Since connection_lost
is synchronous, checking for self._closed.done()
assures that we have actually lost the connection.
): | ||
self._backoff += 1 | ||
error._add_error_label("SystemOverloaded") | ||
error._add_error_label("Retryable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you merge backpressure? Originally I added the incorrect labels here. It should be "SystemOverloadedError" and "RetryableError"
self._backoff += 1 | ||
error._add_error_label("SystemOverloaded") | ||
error._add_error_label("Retryable") | ||
print(f"Setting backoff in {phase}:", self._backoff) # noqa: T201 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of inspecting the error message after the fact, is it possible we can record some state to determine if the error happened during DNS+TCP or after? Like:
# Assume all non dns/tcp/timeout errors mean the server rejected the connection due to overload.
if not errorDuringDnsTcp and not timeoutError:
error._add_error_label("SystemOverloadedError")
Currently testing with this script for async:
and this one for sync: