Development notes, license information in setup.py, README update.

madlag · madlag · commit 253e0f83d4da · 2020-09-01T18:41:17.000+02:00
diff --git a/README.md b/README.md
@@ -7,7 +7,7 @@ This allows very easy experimentation, as you just have to replace the Linear la
 The incentive to create this library is to let people test the idea that **sparse matrices can be used in neural networks**, instead of dense ones, without significantly altering the precision.  
  
 This would be great news as sparse matrices allows savings in both space and compute: a **50% sparse matrix** will use **only 50% memory**, and theoretically will use only 50% of computation.
-However, due to the very optimized nature of cuBLAS based torch.nn.Linear, this lib is slower, by roughly a factor of 2 (this may be improved in the future).
+However, due to the very optimized nature of cuBLAS based torch.nn.Linear, this library is slower, by roughly a factor of 2 (this may be improved in the future).
 But the performance gain of using sparse matrices grows with the sparsity, so a **75% sparse matrix** is roughly **2x** faster than the dense equivalent.
 
 This could prove useful, and could be combined with other methods like distillation and quantization to reduce further the networks.  
@@ -29,7 +29,7 @@ You can use the BlockSparseLinear drop in replacement for torch.nn.Linear in you
 
 Or you can use a utility called BlockSparseModelPatcher to modify easily an existing model before training it.(you cannot magically sparsify a trained existing model, you will need to train it from scratch)
 
-Here is an example with a Roberta Model from Hugging Face ([full example](./doc/notebooks/ModelSparsification.ipynb))
+Here is an example with a Roberta Model from Hugging Face ([full example](doc/notebooks/ModelSparsification.ipynb))
 
 ```python
 from pytorch_block_sparse import BlockSparseModelPatcher
@@ -52,6 +52,18 @@ print(f"Final model parameters count={model.num_parameters()}")
 # => 68 million parameters instead of 84 million parameters (embeddings are taking a lof space in Roberta)
 ```
 
+##Performance
+It's notoriously hard to approach cuBLAS performance with custom CUDA kernels.
+OpenAI kernels for example make ample use of assembly language to achieve a good performance.
+
+The promise of Cutlass was to provide tools that abstract the different parts of CUDA kernels using smart C++ templates.
+
+This allows the `pytorch_block_sparse` library to achieve roughly 50% of cuBLAS performance:
+depending on the exact matrix computation, it achieves 40% to 55% of the cuBLAS performance on large matrices 
+(which is the case when using large batch x sequence sizes in Transformers for example).
+Practically, this means that a Transformer with BlockSparseLinear with a 50% sparsity is as fast as the dense version.
+This may be improved in next releases, especially when newer version of Cutlass are used.   
+
 ## Future work
 - Implement some paper methods (and provide new ones) to optimize the sparse pattern during training, while doing the classic parameter optimization using backprop. The basic idea is to remove some smaller magnitude weights (or blocks of weights) at some positions and try other ones.
   - [Movement Pruning: Adaptive Sparsity by Fine-Tuning](https://arxiv.org/abs/2005.07683)
@@ -66,3 +78,6 @@ In the root directory just execute:
 ```
 python setup.py install 
 ```
+
+# Development Notes
+ You will find them ([here](doc/DevNotes.md))
diff --git a/doc/DevNotes.md b/doc/DevNotes.md
@@ -0,0 +1,95 @@
+# Development Notes
+
+
+This python package provides a PyTorch extension .
+
+
+## Organisation
+### Build
+
+The setup.py script use the standard PyTorch extension mechanism to build the package:
+
+```
+from torch.utils.cpp_extension import BuildExtension, CUDAExtension
+...
+      ext_modules=[
+        CUDAExtension('block_sparse_native',
+                      ['pytorch_block_sparse/native/block_sparse_native.cpp',
+                      'pytorch_block_sparse/native/block_sparse_cutlass_kernel_back.cu',
+                      'pytorch_block_sparse/native/block_sparse_cutlass_kernel.cu'],
+                      extra_compile_args=['-I', '%s/pytorch_block_sparse' % rootdir]
+                      ),
+      ],
+      cmdclass={
+        'build_ext': BuildExtension
+      }
+```
+
+### Native functions python interface
+A single c++ file `block_sparse_native.cpp` provides the native functions visible from python.
+These functions provides access to CUDA kernels which computes :
+ - dense x native -> dense
+ - dense x dense on sparse support -> sparse
+
+### CUDA/Cutlass kernels
+The `*.cu` files in the `native` directory provides the kernel themselves.
+They are using the cutlass primitives available in the `cutlass` subdirectory.
+
+Multiple levels of C++ templating provides dispatch/code generation of the kernels.
+
+The main files in the `cutlass/gemm` directory are `block_task.h` and `block_task_back.h` .
+They express the final CUDA kernel that will be executed, using 
+- `block_loader_.*` to load A and B matrix tiles in an efficient way
+- `thread_accumulator.h` to store the result tiles 'R'
+- `epilogue_function` to combine R with C  `C' = alpha * R + beta * C`
+- `grid_raster_.*` to list the output tiles that must be computed
+
+### block_sparse python module
+This library includes as little native code as possible, because native code is hard to write/debug/understand.
+
+The native functions are performing the performance critical tasks, and the python code in `block_sparse.py` is doing
+all the preparatory work, which is executed only once, or a unfrequently.
+
+The main job of `block_sparse.py` is to build indexes into the sparse matrices.
+Three sets of sparse indices are built:
+- row wise index of non-zero entries (for dense x sparse)
+- column wise index of non-zero entries (for dense x sparse with transposition)
+- linear list of 2D coordinates of non-zero entries (for dense x dense on sparse support)
+
+These structures are created using standard PyTorch primitives, and so are easy to debug, understand,
+or reimplement in other languages.
+
+### block_sparse_linear python module
+The block_sparse_linear is a thin layer on top of `block_sparse`
+It use the linear algebra primitives of block_sparse to create a drop in replacement for `torch.nn.Linear`,
+with the proper back-propagation primitives, implemented using a `torch.autograd.Function` subclass. 
+
+## Testing 
+Debugging CUDA kernels is hard. Fortunately, it's easy to compare the kernel results with
+a reference PyTorch implementation.
+The `tests` directory provides some code to test and measure performance of the library.
+
+## TODO
+
+block_sparse
+- add input parameters sanity checks
+- add dispatch for 
+  - different matrix size -> different dispatch strategy (tile sizes in k-dimension)
+  - different block sizes
+  
+tests  
+  - Refactor/cleanup tests
+    
+doc
+- schema of sparse index structures
+
+cutlass
+- move to 2.x version
+
+cleanup algorithms
+- add algorithms to measure weights importance and optimize the sparsity pattern
+
+
+
+
+
diff --git a/pytorch_block_sparse/block_sparse.py b/pytorch_block_sparse/block_sparse.py
@@ -371,7 +371,7 @@ def matmul_with_output_sparse_support_(self, dense_a, dense_b, overwrite_data =
         assert(shape_c[0] == shape_a[1])
         assert(shape_c[1] == shape_b[1])
 
-        blocks_len = len(self.blocks) // 2
+        blocks_len = self.blocks.shape[0] // 2
         block_shape = self.block_shape
 
         assert ((shape_a[1] % block_shape[1]) == 0)
diff --git a/setup.py b/setup.py
@@ -13,20 +13,17 @@ def readme():
       description='pytorch_block_sparse is a python package for fast block sparse matrices computation.',
       long_description=readme(),
       classifiers=[
-        'Development Status :: 3 - Alpha',
-        'License :: OSI Approved :: MIT License',
+        'Development Status :: 4 - Beta',
+        'License :: OSI Approved :: BSD 3-Clause "New" or "Revised" License',
         'Programming Language :: Python :: 3.0',
-        'Topic :: Text Processing',
       ],
       keywords='',
       url='',
       author='',
       author_email='',
       license='MIT',
       packages=['pytorch_block_sparse'],
-      install_requires=['click'],
-      test_suite='nose.collector',
-      tests_require=['nose', 'nose-cover3'],
+      install_requires=[],
       include_package_data=True,
       zip_safe=False,
       ext_modules=[