Winter cleaning for rate limiting (#147)

Our approach to rate limiting has been in need of major refactoring, and this rearranges and fixes a bunch of things: - Replace our previous `rate_limited(calls_per_second, group)` function with a `RateLimit` class. The previous implementation led to different rate limits that were intermingled in non-obvious ways. The new class gives us a more straightforward object that models a single limit, independent of any other limits, but which can be shared as needed. - Update default rate limits based on consultation with Internet Archive staff. These limits are based on what they currently use as standardized limits, but backed off to 80% of the hard limit (which they requested). - Apply rate limits directly where the requests are made (in `WaybackSession.send()`), so they reliably limit requests to the desired rate. They were previously applied in `WaybackClient` methods, which meant they didn’t account correctly for retries or redirects. - Some other minor refactorings came along for the ride, as well as starting to do type annotations. Fixes #137.
edgi-govdata-archiving · Dec 13, 2023 · 2220fc2 · 2220fc2
1 parent 019a0ab
commit 2220fc2
Show file tree

Hide file tree

Showing 8 changed files with 346 additions and 240 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -20,10 +20,10 @@ commands:
     steps:
       - restore_cache:
           keys:
-            - cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
-            - cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}
-            - cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-
-            - cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-
+            - cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
+            - cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}
+            - cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-
+            - cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-
 
       - run:
           name: Install Dependencies
@@ -56,7 +56,7 @@ commands:
                   pip install .[docs]
 
       - save_cache:
-          key: cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
+          key: cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
           paths:
             - ~/venv
 

diff --git a/docs/source/release-history.rst b/docs/source/release-history.rst
@@ -16,13 +16,33 @@ Breaking Changes
 Features
 ^^^^^^^^
 
-- N/A
+- Rate limiting has been significantly updated under the hood. Separate sessions or clients with customized limits now operate completely independently (their limits previously overlapped in a hard-to-predict way).
+
+  If you need to make multiple :class:`wayback.WaybackSession` instances that share limits, you can create an instance of :class:`wayback.RateLimit` and pass it to each session via the ``*_calls_per_second`` arguments instead of passing an integer or float. For example:
+
+  .. code-block:: python
+
+    # Each of these sessions will be limited to 1 search request every 2
+    # seconds. During a period where only one of these sessions is in use, it
+    # will continue to operate at 1 request every 2 seconds:
+    session1 = WaybackSession(search_calls_per_second=0.5)
+    session2 = WaybackSession(search_calls_per_second=0.5)
+
+    # These sessions will be limited to 1 search request each second as a
+    # combined total. During a period where only one of these sessions is in
+    # use, it will make 1 request per second, but if both are actively in use,
+    # each will make 1 request every 2 seconds:
+    rate = RateLimit(per_second=1)
+    session1 = WaybackSession(search_calls_per_second=rate)
+    session2 = WaybackSession(search_calls_per_second=rate)
+
+  Sessions using the default limits share them via this same mechanism.
 
 
 Fixes & Maintenance
 ^^^^^^^^^^^^^^^^^^^
 
-- N/A
+- The default rate limits have been further tweaked since v0.4.4 based on closer collaboration with Internet Archive staff. Rate limits are also now more accurately applied to each individual request (they were previously applied more roughly, without respect to retries and redirects).
 
 
 v0.4.4 (2023-11-27)

diff --git a/docs/source/usage.rst b/docs/source/usage.rst
@@ -129,6 +129,8 @@ Utilities
 
 .. autoclass:: wayback.Mode
 
+.. autoclass:: wayback.RateLimit
+
 Exception Classes
 -----------------
 

diff --git a/wayback/__init__.py b/wayback/__init__.py
@@ -2,7 +2,7 @@
 __version__ = get_versions()['version']
 del get_versions
 
-from ._utils import memento_url_data  # noqa
+from ._utils import memento_url_data, RateLimit  # noqa
 
 from ._models import (  # noqa
     CdxRecord,

diff --git a/wayback/_client.py b/wayback/_client.py
diff --git a/wayback/_utils.py b/wayback/_utils.py
@@ -1,6 +1,5 @@
-from collections import defaultdict, OrderedDict
+from collections import OrderedDict
 from collections.abc import Mapping, MutableMapping
-from contextlib import contextmanager
 from datetime import date, datetime, timezone
 import email.utils
 import logging
@@ -9,6 +8,7 @@
 import requests.adapters
 import threading
 import time
+from typing import Union
 import urllib.parse
 from .exceptions import SessionClosedError
 
@@ -221,38 +221,71 @@ def set_memento_url_mode(url, mode):
     return format_memento_url(captured_url, timestamp, mode)
 
 
-_last_call_by_group = defaultdict(int)
-_rate_limit_lock = threading.Lock()
-
-
-@contextmanager
-def rate_limited(calls_per_second=1, group='default'):
+class RateLimit:
     """
-    A context manager that restricts entries to its body to occur only N times
-    per second (N can be a float). The current thread will be put to sleep in
-    order to delay calls.
+    ``RateLimit`` is a simple locking mechanism that can be used to enforce
+    rate limits and is safe to use across multiple threads. It can also be used
+    as a context manager.
+
+    Calling `rate_limit_instance.wait()` blocks until a minimum time has passed
+    since the last call. Using `with rate_limit_instance:` blocks entries to
+    the context until a minimum time since the last context entry.
 
     Parameters
     ----------
-    calls_per_second : float or int, default: 2
-        Maximum number of calls into this context allowed per second
-    group : string, default: 'default'
-        Unique name to scope rate limiting. If two contexts have different
-        `group` values, their timings will be tracked separately.
+    per_second : int or float
+        The maximum number of calls per second that are allowed. If 0, a call
+        to `wait()` will never block.
+
+    Examples
+    --------
+    Slow down a tight loop to only occur twice per second:
+
+    >>> limit = RateLimit(per_second=2)
+    >>> for x in range(10):
+    >>>     with limit:
+    >>>         print(x)
     """
-    if calls_per_second <= 0:
-        yield
-    else:
-        with _rate_limit_lock:
-            last_call = _last_call_by_group[group]
-            minimum_wait = 1.0 / calls_per_second
+    def __init__(self, per_second: Union[int, float]):
+        if not isinstance(per_second, (int, float)):
+            raise TypeError('The RateLimit per_second argument must be an int '
+                            f'or float, not {type(per_second).__name__}')
+
+        self._lock = threading.RLock()
+        self._last_call_time = 0
+        if per_second <= 0:
+            self._minimum_wait = 0
+        else:
+            self._minimum_wait = 1.0 / per_second
+
+    def wait(self) -> None:
+        if self._minimum_wait == 0:
+            return
+
+        with self._lock:
             current_time = time.time()
-            if current_time - last_call < minimum_wait:
-                seconds = minimum_wait - (current_time - last_call)
-                logger.debug('Hit %s rate limit, sleeping for %s seconds', group, seconds)
-                time.sleep(seconds)
-            _last_call_by_group[group] = time.time()
-        yield
+            idle_time = current_time - self._last_call_time
+            if idle_time < self._minimum_wait:
+                time.sleep(self._minimum_wait - idle_time)
+
+            self._last_call_time = time.time()
+
+    def __enter__(self) -> None:
+        self.wait()
+
+    def __exit__(self, type, value, traceback) -> None:
+        pass
+
+    @classmethod
+    def make_limit(cls, per_second: Union['RateLimit',  int, float]) -> 'RateLimit':
+        """
+        If the given rate is a ``RateLimit`` object, return it unchanged.
+        Otherwise, create a new ``RateLimit`` with the given rate.
+        """
+        if isinstance(per_second, cls):
+            return per_second
+        else:
+            return cls(per_second)
 
 
 class DepthCountedContext:

diff --git a/wayback/tests/test_client.py b/wayback/tests/test_client.py
@@ -792,7 +792,7 @@ def test_search_rate_limits(self):
                 next(client.search('zew.de'))
         duration_with_limits_custom = time.time() - start_time
 
-        assert 1.9 <= duration_with_limits <= 2.1
+        assert 2.4 <= duration_with_limits <= 2.6
         assert 0.0 <= duration_without_limits <= 0.05
         assert 0.0 <= duration_with_limits_custom <= 1.05
 
@@ -828,6 +828,6 @@ def test_memento_rate_limits(self):
                 client.get_memento(cdx)
         duration_with_limits_custom = time.time() - start_time
 
-        assert 0.33 <= duration_with_limits <= 0.43
+        assert 1.15 <= duration_with_limits <= 1.35
         assert 0.0 <= duration_without_limits <= 0.05
         assert 0.5 <= duration_with_limits_custom <= 0.55
diff --git a/wayback/tests/test_utils.py b/wayback/tests/test_utils.py
@@ -2,7 +2,7 @@
 import email.utils
 import pytest
 import time
-from .._utils import memento_url_data, rate_limited, parse_retry_after
+from .._utils import memento_url_data, RateLimit, parse_retry_after
 
 
 class TestMementoUrlData:
@@ -42,38 +42,46 @@ def test_raises_for_non_string_input(self):
             memento_url_data(None)
 
 
-class TestRateLimited:
+class TestRateLimit:
 
     def test_call_per_seconds(self):
-        """Test that the rate limit is accurately applied.
-        It also checks that two rate limits applied sequentially do not interfere with another."""
+        """
+        Test that the rate limit is accurately applied. It also checks that two
+        rate limits applied sequentially do not interfere with another.
+        """
+        limit3 = RateLimit(per_second=3)
+        limit1 = RateLimit(per_second=1)
+        limit0 = RateLimit(per_second=0)
+
         start_time = time.time()
         for i in range(4):
-            with rate_limited(calls_per_second=3, group='cps1'):
+            with limit3:
                 pass
         assert 1.0 <= time.time() - start_time <= 1.1
 
         start_time = time.time()
         for i in range(3):
-            with rate_limited(calls_per_second=1, group='cps2'):
+            with limit1:
                 pass
         assert 2.0 <= time.time() - start_time <= 2.1
 
         start_time = time.time()
         for i in range(3):
-            with rate_limited(calls_per_second=0, group='cps3'):
+            with limit0:
                 pass
         assert 0 <= time.time() - start_time <= 0.1
 
     def test_simultaneous_ratelimits(self):
         """Check that multiple rate limits do not interfere with another."""
+        limit1 = RateLimit(per_second=1)
+        limit3 = RateLimit(per_second=3)
         start_time = time.time()
         # The first loop should take 1 second, as it waits on the sim1 lock,
         # the second loop 0.66 seconds, since it waits twice on sim2.
         for i in range(2):
-            with rate_limited(calls_per_second=1, group='sim1'):
+            with limit1:
                 for j in range(3):
-                    with rate_limited(calls_per_second=3, group='sim2'):
+                    with limit3:
                         pass
         assert 1.66 <= time.time() - start_time <= 1.7
-Original file line number
+Diff line change
@@ Expand Up / @@ -129,6 +129,8 @@ Utilities @@
     .. autoclass:: wayback.Mode
+    .. autoclass:: wayback.RateLimit
     Exception Classes
     -----------------
@@ Expand Down @@