Performance

Environment: Python 3.13.2, CPython, Apple M-series (arm64), Clang 20.1.0

Architecture

The entire cache lookup happens in a single Rust __call__:

Python: fn(42)
  └─ tp_call (PyO3) ─────────────────────────────── one FFI crossing
       ├─ hash(args)           via ffi::PyObject_Hash (raw FFI, no PyO3 wrapper)
       ├─ shard select         hash & shard_mask (power-of-2 bitmask)
       ├─ GilCell::read()      zero-cost under GIL (UnsafeCell)
       ├─ HashMap lookup       hashbrown + passthrough hasher (no re-hash)
       ├─ equality check       via ffi::PyObject_RichCompareBool (borrowed pointer)
       ├─ SIEVE visited=1      AtomicBool store, lock-free
       └─ return cached value

No Python wrapper function, no serialization, no key allocation on hits. The lookup uses BorrowedArgs (a raw pointer + precomputed hash) via hashbrown's Equivalent trait. A CacheKey is only created on cache miss when the entry needs to be stored.

SIEVE eviction

Both the in-process and shared memory backends use SIEVE - an eviction algorithm that gets better hit rates than LRU with O(1) cost per operation.

How it works:

On hit: set visited = 1 - one store, no linked-list reordering, no lock needed.
On evict: a "hand" scans the entry list. Visited entries get a second chance (bit cleared to 0); unvisited entries are evicted.

This is simpler than LRU (no list reordering on every hit) and gets better hit rates on skewed workloads. The key thing: cache hits never need a write lock.

Single-threaded throughput vs cache size

Cache Size	warp_cache	lru_cache	cachetools	cachebox	moka_py	wc/ct	wc/lru
32	16.5M	21.6M	611K	1.1M	3.2M	27.0x	0.76x
64	17.3M	23.0M	650K	1.2M	3.4M	26.6x	0.75x
128	19.1M	26.4M	710K	1.3M	3.6M	26.9x	0.72x
256	20.4M	31.0M	826K	1.5M	3.9M	24.7x	0.66x
512	21.8M	36.0M	946K	1.8M	4.2M	23.0x	0.61x
1024	22.9M	40.1M	1.2M	2.4M	4.6M	19.1x	0.57x

TTL throughput (cache size = 256)

TTL	warp_cache	cachetools	moka_py	wc/ct
1ms	7.3M	604K	2.6M	12.1x
10ms	7.3M	527K	2.7M	13.9x
100ms	7.4M	534K	2.7M	13.8x
1s	7.2M	523K	2.7M	13.8x
None	7.3M	523K	2.7M	13.9x

TTL adds minimal overhead - the expiry timestamp is checked during the normal read path, with no background eviction thread.

Multi-threaded throughput (cache size = 256)

Threads	warp_cache	lru_cache + Lock	cachetools + Lock	cachebox	moka_py	wc/lru
1	20.7M	12.6M	767K	1.5M	3.7M	1.64x
2	20.7M	12.3M	800K	1.6M	3.8M	1.68x
4	20.8M	12.5M	788K	1.5M	3.7M	1.66x
8	20.4M	12.6M	793K	1.5M	3.7M	1.62x
16	19.5M	11.9M	795K	1.5M	3.7M	1.64x
32	17.8M	11.5M	784K	1.4M	3.8M	1.55x

warp_cache stays at ~18-21M ops/s regardless of thread count, about 1.6-1.7x faster than lru_cache + Lock under the GIL. The reason: GilCell has no locking overhead (the GIL itself serializes access), while lru_cache + Lock pays for a threading.Lock() on every call. Under free-threaded Python (no GIL), warp_cache switches to per-shard RwLock so different threads can read different shards in parallel.

Shared memory backend

The shared memory backend uses SIEVE with lock-free reads - no Mutex, no RwLock. ShmCache uses interior mutability: reads go through the seqlock's optimistic path, and the visited bit is set via a direct store. Writes acquire the seqlock's TTAS spinlock internally. Size limit checks use cached struct fields so we don't read the shared memory header on every call.

Memory vs shared throughput

Backend	Throughput	Hit Rate	Notes
Memory (in-process)	20.0M ops/s	71.2%	Sharded hashbrown HashMap + GilCell + passthrough hasher + SIEVE
Shared (mmap, single process)	9.7M ops/s	73.0%	Seqlock + lock-free reads, no Mutex

The shared backend reaches about 49% of in-process speed. The gap is serialization (serde fast-path for primitives, pickle fallback), ahash of key bytes, seqlock overhead, and mmap copy - unavoidable costs when you work across processes.

Multi-process scaling

Processes	Total Throughput	Per-Process Avg
1	5.0M ops/s	5.0M ops/s
2	7.8M ops/s	3.4M ops/s
4	8.1M ops/s	2.0M ops/s
8	6.5M ops/s	1.0M ops/s

Because reads don't take locks, multi-process scaling works well. Total throughput peaks at 4 processes (8.1M ops/s) - 1.6x the single-process rate. Even with 8 processes on the same mmap'd file, throughput stays at 6.5M ops/s.

For comparison, the previous LRU-based shared backend did 3.1M ops/s at 4 processes and 1.9M ops/s at 8 processes. SIEVE is 2.5x and 3.4x faster respectively, because reads no longer need a write lock.

SIEVE eviction quality

SIEVE also gets better hit rates than LRU across all workload patterns we tested. These benchmarks compare warp_cache (SIEVE) against functools.lru_cache (LRU) using 1M requests with Zipf-distributed keys.

Run them yourself: make bench-sieve or python benchmarks/bench_sieve.py --quick.

Hit ratio vs cache size

With Zipf alpha=1.0 over 10K unique keys, SIEVE beats LRU at every cache size. The biggest difference is at 10% cache ratio (21.6% fewer misses):

Cache %	SIEVE	LRU	Miss Reduction
0.1%	26.2%	13.1%	+15.1%
1%	49.6%	39.0%	+17.4%
5%	67.2%	58.6%	+20.7%
10%	74.5%	67.5%	+21.6%
25%	84.0%	79.9%	+20.6%
50%	91.3%	89.7%	+15.8%

Scan resistance

This is SIEVE's key advantage. Hot working set (100 keys, Zipf) mixed with sequential scans (10K unique keys, each accessed once). Cache size = 200 (fits the entire hot set). SIEVE protects hot items via the visited bit; LRU pushes them out during scans:

Hot Fraction	SIEVE	LRU	Miss Reduction
100%	99.99%	99.99%	+0.0%
90%	90.0%	88.9%	+10.0%
80%	80.0%	76.0%	+16.6%
70%	69.9%	63.5%	+17.6%
50%	50.0%	40.9%	+15.3%
30%	30.0%	21.1%	+11.3%

At 70% hot fraction, SIEVE retains almost all hot items (69.9% hit rate ≈ hot fraction) while LRU drops to 63.5% as scans push hot entries out of the cache.

One-hit-wonder filtering

Mix of Zipf-distributed reused keys with unique one-time keys (one-hit wonders). SIEVE inserts with visited=0 and evicts on the first hand scan; LRU gives every entry a full tenure through the cache:

OHW Ratio	SIEVE	LRU	Miss Reduction
0%	72.4%	65.0%	+21.2%
25%	53.9%	43.7%	+18.1%
50%	35.6%	25.8%	+13.2%
75%	17.2%	10.6%	+7.4%

Working set shift

Three phases: Zipf over keys 0-999, then keys 1000-1999 (completely new set), then back to 0-999. Cache size = 200. Both algorithms adapt, but SIEVE maintains a consistent advantage:

Phase	SIEVE	LRU
Phase 1 (keys 0-999)	75.5%	69.7%
Phase 2 (keys 1000-1999)	75.6%	69.9%
Phase 3 (return to 0-999)	75.5%	69.6%

Benchmarks: benchmarks/bench_sieve.py, 1M ops, Zipf-distributed keys, seed=42.

Where the remaining gap lives

Memory backend vs lru_cache

At cache size 256, a cache hit takes ~49ns vs lru_cache's ~32ns:

Operation	lru_cache (C)	warp_cache (Rust)	Delta
Call dispatch (`tp_call`)	~5ns	~10ns	+5ns
Hash args (`PyObject_Hash`)	~15ns	~15ns	0
Shard select + lock	~0ns	~0ns	0
Table lookup + key equality	~5ns	~5ns	0
SIEVE visited store	~0ns	~1ns	+1ns
Refcount management	~2ns	~5ns	+3ns
Return value	~2ns	~2ns	0
Total	~34ns	~51ns	+17ns

Most of the gap is PyO3 call dispatch (~5ns) - PyO3's tp_call shim extracts GIL tokens, validates and converts argument pointers. lru_cache receives raw PyObject* directly. This is just the cost of using a safe FFI layer.

Three optimizations that made the biggest difference:

GilCell - Under GIL-enabled Python, GilCell (an UnsafeCell wrapper) replaces parking_lot::RwLock, saving ~8ns per hit. Under free-threaded Python (#[cfg(Py_GIL_DISABLED)]), real RwLock is used instead.
Borrowed key lookup - The hit path uses BorrowedArgs (raw pointer + precomputed hash) via hashbrown's Equivalent trait. No CacheKey allocated, no args.clone(). CacheKey is only created on cache miss.
Passthrough hasher - PassthroughHasher feeds Python's precomputed hash directly to hashbrown, skipping foldhash re-hashing (~1-2ns saved).

On cache hit, we only acquire a GilCell read (no overhead under GIL) or a per-shard read lock (under free-threading). The SIEVE visited bit uses AtomicBool::store(Relaxed) which needs no lock upgrade. Write lock is only taken on cache misses for SIEVE eviction.

Shared backend vs memory backend

The shared backend hit path takes ~103ns vs the memory backend's ~49ns. The ~54ns difference is unavoidable cross-process overhead:

Operation	Cost	Notes
Key serialization (serde fast-path)	~10ns	Unavoidable for cross-process
ahash of key bytes	~4ns	Deterministic hash (Python's is randomized per-process)
Seqlock (read_begin + validate)	~15ns	Optimistic lock-free read
HT lookup in mmap	~10ns	Slightly slower than hashbrown in heap
Value `.to_vec()` copy	~8ns	Must copy from mmap before seqlock validate
Value deserialization (serde)	~8ns	Unavoidable for cross-process
Total delta	~55ns

No Mutex or RwLock on the shared backend's read path - ShmCache uses interior mutability, the seqlock handles all the synchronization.

Python 3.13+/3.14 free-threading

Under free-threaded Python (no GIL):

warp_cache: #[cfg(Py_GIL_DISABLED)] switches from GilCell to real per-shard RwLock, so threads can read different shards in parallel
lru_cache: needs a real lock without the GIL's implicit protection, gets slower
Trade-off: atomic refcounting adds ~2-5ns to single-threaded cost

The benchmark runner (benchmarks/_bench_runner.py) automatically creates temporary uv venvs for each Python version, builds warp_cache via maturin, and runs all benchmarks.

Optimization Journey

Phase	Change	Throughput	Ratio vs lru_cache
1. Serialization + Python wrapper	pickle.dumps, functools.wraps, 2 FFI crossings	~500K ops/s	0.02x
2. PyObject keys + Rust `__call__`	Precomputed hash, single FFI crossing	~13-18M ops/s	0.56-0.68x
3. Compiler: fat LTO + codegen-units=1	Cross-crate inlining of PyO3 wrappers	+10-15%	0.66-0.74x
4. Raw FFI for key equality	`ffi::PyObject_RichCompareBool` instead of `Python::with_gil`	+multi-thread	-
5. SIEVE eviction	Unified eviction for both backends, lock-free reads	+12% hit rate	-
6. Sharded RwLock (hashbrown)	Per-shard read locks, parallel reads across shards	~18M ops/s	0.56x
7. Shared backend: remove Mutex	`ShmCache` interior mutability, cached hash state + size limits	cleaner arch	-
8. Passthrough hasher + borrowed keys + GilCell	No re-hash, no alloc on hit, no lock under GIL	~20M ops/s	0.66x

Benchmarks: 100K ops per config, median of 3 rounds, Zipf-distributed keys (2000 unique), time.perf_counter(). Python 3.13.2, cachetools 7.0.1, moka_py 0.3.0, cachebox 5.2.2. Source: benchmarks/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance

Architecture

SIEVE eviction

Single-threaded throughput vs cache size

TTL throughput (cache size = 256)

Multi-threaded throughput (cache size = 256)

Shared memory backend

Memory vs shared throughput

Multi-process scaling

SIEVE eviction quality

Hit ratio vs cache size

Scan resistance

One-hit-wonder filtering

Working set shift

Where the remaining gap lives

Memory backend vs lru_cache

Shared backend vs memory backend

Python 3.13+/3.14 free-threading

Optimization Journey

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance

Architecture

SIEVE eviction

Single-threaded throughput vs cache size

TTL throughput (cache size = 256)

Multi-threaded throughput (cache size = 256)

Shared memory backend

Memory vs shared throughput

Multi-process scaling

SIEVE eviction quality

Hit ratio vs cache size

Scan resistance

One-hit-wonder filtering

Working set shift

Where the remaining gap lives

Memory backend vs lru_cache

Shared backend vs memory backend

Python 3.13+/3.14 free-threading

Optimization Journey