A Proof of Concept deformer node, Autodesk® Maya® Plugin.
The plug-in isn't intended for production use, it's highly unstable: crashes on heavy geometry, doesn't handle exceptions, uses blocked ranges, etc.
The CPU part is multithreaded using Intel TBB, the GPU part is mostly implemented with custom kernels and uses the cuBLAS batched subroutine for matrix inversion. There're several ways to make it faster, e.g., to use more intensively cuBLAS so as to fully utilise hardware, to use more clever parallel reduction functions and whatnot.
The deformer implements the basic algorithm described in the paper
Alias|wavefront "Skinning Characters using Surface-Oriented Free-Form Deformations"
The algorithm for finding a distance from a point to a triangle is from the document made by Geometric Tools, LLC
https://www.geometrictools.com/Documentation/DistancePoint3Triangle3.pdf
The testing data was quite small, nonetheless it gave some insights. At first glance, the GPU mode doesn't perform well in comparison to its CPU counterpart.
However, it seems that the bottleneck is the memory allocations on a GPU and the data transfers between a GPU and a CPU. The actual computations on a GPU increase with the noticeable lower rate than on a CPU. In the current implementation the CUDA part allocates, transfers and frees all necessary data on every call during the deformation. Static data storage for pointers to device memory or using Unified Memory can solve the problem.
The repository contains a csv file with the benchmarks, either R (ggplot) or python (matplotlib) script can be used in order to visualise the data as plots.