Sparse 2D convolution in Python via Toeplitz matrix methods.
Fast when the kernel is small, the input is sparse, and/or many arrays share the same kernel.
pip install sparse_convolution
Or from source:
git clone https://github.com/RichieHakim/sparse_convolution
cd sparse_convolution
pip install -e .
import numpy as np
import scipy.sparse
import sparse_convolution as sc
x = scipy.sparse.random(100, 100, density=0.01)
k = np.random.rand(5, 5)
conv = sc.Toeplitz_convolution2d(x_shape=x.shape, k=k, mode='same')
out = conv(x=x, batching=False).toarray()Input: (n_images, H * W) sparse matrix. Output: (n_images, H_out * W_out).
x_batch = scipy.sparse.vstack([
scipy.sparse.random(100, 100, density=0.01).reshape(1, -1)
for _ in range(50)
]).tocsr()
conv = sc.Toeplitz_convolution2d(x_shape=(100, 100), k=k, mode='same')
out = conv(x=x_batch, batching=True)Four methods, each with selectable backends:
| Method | numpy | numba | torch |
|---|---|---|---|
direct |
n/a | yes | n/a |
precomputed |
yes | yes | yes |
lazy |
yes | n/a | yes |
gather_scatter |
yes | yes | yes |
direct: Batch-parallel scatter convolution with thread-local dense buffers (numba only). For each image in parallel, scatters kernel-weighted input values into an L2-cache-sized accumulator buffer, then extracts nonzeros into CSR format. Uses a two-phase approach: a lightweight boolean counting pass (1-byte flags, no float arithmetic) determines exact output sizes, then the scatter pass writes directly to right-sized arrays with zero waste. Interior pixels (~92-100%) skip bounds checking entirely via precomputed safe regions. O(nnz × K) per image with no init overhead. Fastest method across nearly all configurations. Requiresnumba.precomputed: Builds a sparse Toeplitz matrix at init; fast batched matmul. Best for large batches with the same kernel when numba is not available.lazy: COO broadcasting, no init cost. Best for very sparse inputs with small batches.gather_scatter: Per-kernel-position scatter into a dense accumulator. General-purpose method for sparse batched inputs. Usesnumbaautomatically when available, and falls back tonumpyotherwise.
Backend selection:
numpy: scipy/numpy ops. Always available.numba: JIT-compiled parallel loops. Fastest on CPU for batched inputs. Requiresnumba.torch: PyTorch ops with optional GPU. Requirestorch.
conv = sc.Toeplitz_convolution2d(
x_shape=(100, 100),
k=k,
mode='same',
method='direct', # default
backend=None, # numba for the default direct method
)If backend=None (default), direct uses numba. For environments without numba, choose a numpy-capable method explicitly, such as method='gather_scatter', backend='numpy' or method='precomputed', backend='numpy'.
- Toeplitz convolution: stackoverflow.com/a/51865516, alisaaalehi/convolution_as_multiplication
- 1D convolution matrix: scipy.linalg.convolution_matrix
All benchmarks run on CPU with 1s minimum measurement time per configuration (median reported). Nine method+backend combinations compared across six scaling sweeps.
Six scaling sweeps varying batch size, density, image size, and kernel size. direct+numba (brown stars) is the fastest method in nearly all regimes.
Each cell shows the winning method and total time (init + call) for that batch size × density combination. direct+numba wins 28 of 36 configurations with an average 4.75× speedup over the second-fastest method.







