Skip to content

Commit

Permalink
Winter cleaning for rate limiting (#147)
Browse files Browse the repository at this point in the history
Our approach to rate limiting has been in need of major refactoring, and this rearranges and fixes a bunch of things:

- Replace our previous `rate_limited(calls_per_second, group)` function with a `RateLimit` class. The previous implementation led to different rate limits that were intermingled in non-obvious ways. The new class gives us a more straightforward object that models a single limit, independent of any other limits, but which can be shared as needed.

- Update default rate limits based on consultation with Internet Archive staff. These limits are based on what they currently use as standardized limits, but backed off to 80% of the hard limit (which they requested).

- Apply rate limits directly where the requests are made (in `WaybackSession.send()`), so they reliably limit requests to the desired rate. They were previously applied in `WaybackClient` methods, which meant they didn’t account correctly for retries or redirects.

- Some other minor refactorings came along for the ride, as well as starting to do type annotations.

Fixes #137.
  • Loading branch information
Mr0grog authored Dec 13, 2023
1 parent 019a0ab commit 2220fc2
Show file tree
Hide file tree
Showing 8 changed files with 346 additions and 240 deletions.
10 changes: 5 additions & 5 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,10 @@ commands:
steps:
- restore_cache:
keys:
- cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
- cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}
- cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-
- cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-
- cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
- cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}
- cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-
- cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-

- run:
name: Install Dependencies
Expand Down Expand Up @@ -56,7 +56,7 @@ commands:
pip install .[docs]
- save_cache:
key: cache_v7-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
key: cache_v9-wayback-<< parameters.python-version >>-{{ arch }}-{{ checksum "requirements.txt" }}-{{ checksum "requirements-dev.txt" }}-{{ checksum "requirements-docs.txt" }}
paths:
- ~/venv

Expand Down
24 changes: 22 additions & 2 deletions docs/source/release-history.rst
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,33 @@ Breaking Changes
Features
^^^^^^^^

- N/A
- Rate limiting has been significantly updated under the hood. Separate sessions or clients with customized limits now operate completely independently (their limits previously overlapped in a hard-to-predict way).

If you need to make multiple :class:`wayback.WaybackSession` instances that share limits, you can create an instance of :class:`wayback.RateLimit` and pass it to each session via the ``*_calls_per_second`` arguments instead of passing an integer or float. For example:

.. code-block:: python
# Each of these sessions will be limited to 1 search request every 2
# seconds. During a period where only one of these sessions is in use, it
# will continue to operate at 1 request every 2 seconds:
session1 = WaybackSession(search_calls_per_second=0.5)
session2 = WaybackSession(search_calls_per_second=0.5)
# These sessions will be limited to 1 search request each second as a
# combined total. During a period where only one of these sessions is in
# use, it will make 1 request per second, but if both are actively in use,
# each will make 1 request every 2 seconds:
rate = RateLimit(per_second=1)
session1 = WaybackSession(search_calls_per_second=rate)
session2 = WaybackSession(search_calls_per_second=rate)
Sessions using the default limits share them via this same mechanism.


Fixes & Maintenance
^^^^^^^^^^^^^^^^^^^

- N/A
- The default rate limits have been further tweaked since v0.4.4 based on closer collaboration with Internet Archive staff. Rate limits are also now more accurately applied to each individual request (they were previously applied more roughly, without respect to retries and redirects).


v0.4.4 (2023-11-27)
Expand Down
2 changes: 2 additions & 0 deletions docs/source/usage.rst
Original file line number Diff line number Diff line change
Expand Up @@ -129,6 +129,8 @@ Utilities

.. autoclass:: wayback.Mode

.. autoclass:: wayback.RateLimit

Exception Classes
-----------------

Expand Down
2 changes: 1 addition & 1 deletion wayback/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
__version__ = get_versions()['version']
del get_versions

from ._utils import memento_url_data # noqa
from ._utils import memento_url_data, RateLimit # noqa

from ._models import ( # noqa
CdxRecord,
Expand Down
429 changes: 236 additions & 193 deletions wayback/_client.py

Large diffs are not rendered by default.

89 changes: 61 additions & 28 deletions wayback/_utils.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,5 @@
from collections import defaultdict, OrderedDict
from collections import OrderedDict
from collections.abc import Mapping, MutableMapping
from contextlib import contextmanager
from datetime import date, datetime, timezone
import email.utils
import logging
Expand All @@ -9,6 +8,7 @@
import requests.adapters
import threading
import time
from typing import Union
import urllib.parse
from .exceptions import SessionClosedError

Expand Down Expand Up @@ -221,38 +221,71 @@ def set_memento_url_mode(url, mode):
return format_memento_url(captured_url, timestamp, mode)


_last_call_by_group = defaultdict(int)
_rate_limit_lock = threading.Lock()


@contextmanager
def rate_limited(calls_per_second=1, group='default'):
class RateLimit:
"""
A context manager that restricts entries to its body to occur only N times
per second (N can be a float). The current thread will be put to sleep in
order to delay calls.
``RateLimit`` is a simple locking mechanism that can be used to enforce
rate limits and is safe to use across multiple threads. It can also be used
as a context manager.
Calling `rate_limit_instance.wait()` blocks until a minimum time has passed
since the last call. Using `with rate_limit_instance:` blocks entries to
the context until a minimum time since the last context entry.
Parameters
----------
calls_per_second : float or int, default: 2
Maximum number of calls into this context allowed per second
group : string, default: 'default'
Unique name to scope rate limiting. If two contexts have different
`group` values, their timings will be tracked separately.
per_second : int or float
The maximum number of calls per second that are allowed. If 0, a call
to `wait()` will never block.
Examples
--------
Slow down a tight loop to only occur twice per second:
>>> limit = RateLimit(per_second=2)
>>> for x in range(10):
>>> with limit:
>>> print(x)
"""
if calls_per_second <= 0:
yield
else:
with _rate_limit_lock:
last_call = _last_call_by_group[group]
minimum_wait = 1.0 / calls_per_second
def __init__(self, per_second: Union[int, float]):
if not isinstance(per_second, (int, float)):
raise TypeError('The RateLimit per_second argument must be an int '
f'or float, not {type(per_second).__name__}')

self._lock = threading.RLock()
self._last_call_time = 0
if per_second <= 0:
self._minimum_wait = 0
else:
self._minimum_wait = 1.0 / per_second

def wait(self) -> None:
if self._minimum_wait == 0:
return

with self._lock:
current_time = time.time()
if current_time - last_call < minimum_wait:
seconds = minimum_wait - (current_time - last_call)
logger.debug('Hit %s rate limit, sleeping for %s seconds', group, seconds)
time.sleep(seconds)
_last_call_by_group[group] = time.time()
yield
idle_time = current_time - self._last_call_time
if idle_time < self._minimum_wait:
time.sleep(self._minimum_wait - idle_time)

self._last_call_time = time.time()

def __enter__(self) -> None:
self.wait()

def __exit__(self, type, value, traceback) -> None:
pass

@classmethod
def make_limit(cls, per_second: Union['RateLimit', int, float]) -> 'RateLimit':
"""
If the given rate is a ``RateLimit`` object, return it unchanged.
Otherwise, create a new ``RateLimit`` with the given rate.
"""
if isinstance(per_second, cls):
return per_second
else:
return cls(per_second)


class DepthCountedContext:
Expand Down
4 changes: 2 additions & 2 deletions wayback/tests/test_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -792,7 +792,7 @@ def test_search_rate_limits(self):
next(client.search('zew.de'))
duration_with_limits_custom = time.time() - start_time

assert 1.9 <= duration_with_limits <= 2.1
assert 2.4 <= duration_with_limits <= 2.6
assert 0.0 <= duration_without_limits <= 0.05
assert 0.0 <= duration_with_limits_custom <= 1.05

Expand Down Expand Up @@ -828,6 +828,6 @@ def test_memento_rate_limits(self):
client.get_memento(cdx)
duration_with_limits_custom = time.time() - start_time

assert 0.33 <= duration_with_limits <= 0.43
assert 1.15 <= duration_with_limits <= 1.35
assert 0.0 <= duration_without_limits <= 0.05
assert 0.5 <= duration_with_limits_custom <= 0.55
26 changes: 17 additions & 9 deletions wayback/tests/test_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
import email.utils
import pytest
import time
from .._utils import memento_url_data, rate_limited, parse_retry_after
from .._utils import memento_url_data, RateLimit, parse_retry_after


class TestMementoUrlData:
Expand Down Expand Up @@ -42,38 +42,46 @@ def test_raises_for_non_string_input(self):
memento_url_data(None)


class TestRateLimited:
class TestRateLimit:

def test_call_per_seconds(self):
"""Test that the rate limit is accurately applied.
It also checks that two rate limits applied sequentially do not interfere with another."""
"""
Test that the rate limit is accurately applied. It also checks that two
rate limits applied sequentially do not interfere with another.
"""
limit3 = RateLimit(per_second=3)
limit1 = RateLimit(per_second=1)
limit0 = RateLimit(per_second=0)

start_time = time.time()
for i in range(4):
with rate_limited(calls_per_second=3, group='cps1'):
with limit3:
pass
assert 1.0 <= time.time() - start_time <= 1.1

start_time = time.time()
for i in range(3):
with rate_limited(calls_per_second=1, group='cps2'):
with limit1:
pass
assert 2.0 <= time.time() - start_time <= 2.1

start_time = time.time()
for i in range(3):
with rate_limited(calls_per_second=0, group='cps3'):
with limit0:
pass
assert 0 <= time.time() - start_time <= 0.1

def test_simultaneous_ratelimits(self):
"""Check that multiple rate limits do not interfere with another."""
limit1 = RateLimit(per_second=1)
limit3 = RateLimit(per_second=3)
start_time = time.time()
# The first loop should take 1 second, as it waits on the sim1 lock,
# the second loop 0.66 seconds, since it waits twice on sim2.
for i in range(2):
with rate_limited(calls_per_second=1, group='sim1'):
with limit1:
for j in range(3):
with rate_limited(calls_per_second=3, group='sim2'):
with limit3:
pass
assert 1.66 <= time.time() - start_time <= 1.7

Expand Down

0 comments on commit 2220fc2

Please sign in to comment.