Non-local means takes a long time compute as it is quadratic for the number of pixels in an image. For simple images this is OK. For larger or three-dimensional (common in medical imaging) images this is impeditive for its application. Here, I introduce a PyTorch-based solution which uses convolutions to extract neighbours (non-local means here does not use the complete image but rather a neighbourhood with $nn$ pixels or $nn*n$ voxels) and calculates the non-local means average. By porting this to PyTorch we can make easy use of the very efficient GPU parallelization and speed up what oftentimes is a very time consuming algorithm.
- You want to run NLM for small images: just use
scikit-image
- You want to run NLM for bigger images AND you have a GPU: use this
- You want to run NLM for relatively big images and you DO NOT have a GPU: good luck
I only benchmarked torch_nlm
against scikit-image
in 2d because the latter is prohibitively slow in 3d. Results below.
For an image
To obtain the non-local mean of this pixel:
$$\frac{1}{W}\sum{}^{h,w}{a,b=1} w(I{i,j},I_{a,b}) * I_{a,b}$$
where
In other words, the non-local means of a given pixel is the weighted average of all pixels. Weights, here, are calculated as the
To use this package all you have to do is clone and install this (a pyproject.toml
is provided so that you can easily install this with poetry
). Alternatively, use requirements.txt
with pip
(i.e. pip install -r requirements.txt
).
Installation with pip: this is probably the version which will be the less painful to use:
pip install torch_nlm
Or, if you already have all the dependencies:
pip install torch_nlm --no-dependencies
Installation with setup.py
: also easy to use:
python setup.py install
Two main functions are exported: nlm2d
and nlm3d
, which are aliases for the most efficient torch
-based NLMM versions (apply_nonlocal_means_2d_mem_efficient
and apply_nonlocal_means_3d_mem_efficient
), respectively. So if you want to apply it to your favourite image and have a CUDA compatible GPU:
import torch # necessary for obvious reasons
from torch_nlm import nlm2d
image = ... # here you define your image
# allocate image to your favourite device
image_torch = torch.as_tensor(image).to("cuda")
image_nlm = nlm2d(image_torch, # the image
kernel_size=11, # neighbourhood size
std=1.0, # the sigma
kernel_size_mean=3, # the kernel used to compute the average pixel intensity
sub_filter_size=32 # how many neighbourhoods are computed per iteration
)
sub_filter_size
is what allows large neighbourhoods - given that users may have relatively small GPU cards, they may opt for smaller sub_filter_sizes
which will enable them to load much smaller sets of neighbourhoods for distance/weight calculations. You may want to run a few tests to figure out the best sub_filter_size
before deploying this en masse.
Since GPU allocation can be time consuming and the user may have a lot of images to process, it might not be a terrible idea to process images as batches rather than as separate scripts.
This code was optimized for speed. Three main functions are provided here - apply_nonlocal_means_2d
, apply_windowed_nonlocal_means_2d
and apply_nonlocal_means_2d_mem_efficient
. The first two are development versions, the latter is the one you should use (exposed as nlm_2d
).
Retrieves all neighbours as a large tensor and calculates the NLM of the image.
Problems: large neighbourhoods will lead to OOM
Does the same as apply_nonlocal_means_2d
but uses strided patches to do this, thus reducing memory requirements.
Problems: leads to visible striding patch artifacts
Does the same as apply_nonlocal_means_2d
but loops over sets of neighbourhoods to calculate weights.
Problems: none for now! But time is a teacher to us all.
The good aspect of this is that it requires very little effort to generalise these functions to 3D. These are made available with the same names as above but replacing 2d
with 3d
. The version you want to use is nlm_3d
.
For a large image such as assets/threegorges-1024x1127.jpg
(source; size: 1024x1127), apply_nonlocal_means_2d_mem_efficient
takes ~3-5 seconds with a neighbourhood with 51x51 pixels when running on GPU.
Example below (obtained by running python test.py assets/threegorges-1024x1127.jpg
):
- First panel: original image
- Second panel: original image + noise
- Third panel: original image + noise + NLM
- Fourth panel: difference between original image and original image + noise + NLM
Note on benchmarking: while 2d benchmarks are reasonable, 3d benchmarks will take a lot of time because of scikit-image
's implementation. Expect times of ~4,000 seconds for a torch_nlm
ran in ~70-80 seconds 😊). You will need scikit-image
for benchmarking.