Skip to content

Commit 253e0f8

Browse files
committed
Development notes, license information in setup.py, README update.
1 parent fb7c215 commit 253e0f8

File tree

4 files changed

+116
-9
lines changed

4 files changed

+116
-9
lines changed

Diff for: README.md

+17-2
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ This allows very easy experimentation, as you just have to replace the Linear la
77
The incentive to create this library is to let people test the idea that **sparse matrices can be used in neural networks**, instead of dense ones, without significantly altering the precision.
88

99
This would be great news as sparse matrices allows savings in both space and compute: a **50% sparse matrix** will use **only 50% memory**, and theoretically will use only 50% of computation.
10-
However, due to the very optimized nature of cuBLAS based torch.nn.Linear, this lib is slower, by roughly a factor of 2 (this may be improved in the future).
10+
However, due to the very optimized nature of cuBLAS based torch.nn.Linear, this library is slower, by roughly a factor of 2 (this may be improved in the future).
1111
But the performance gain of using sparse matrices grows with the sparsity, so a **75% sparse matrix** is roughly **2x** faster than the dense equivalent.
1212

1313
This could prove useful, and could be combined with other methods like distillation and quantization to reduce further the networks.
@@ -29,7 +29,7 @@ You can use the BlockSparseLinear drop in replacement for torch.nn.Linear in you
2929

3030
Or you can use a utility called BlockSparseModelPatcher to modify easily an existing model before training it.(you cannot magically sparsify a trained existing model, you will need to train it from scratch)
3131

32-
Here is an example with a Roberta Model from Hugging Face ([full example](./doc/notebooks/ModelSparsification.ipynb))
32+
Here is an example with a Roberta Model from Hugging Face ([full example](doc/notebooks/ModelSparsification.ipynb))
3333

3434
```python
3535
from pytorch_block_sparse import BlockSparseModelPatcher
@@ -52,6 +52,18 @@ print(f"Final model parameters count={model.num_parameters()}")
5252
# => 68 million parameters instead of 84 million parameters (embeddings are taking a lof space in Roberta)
5353
```
5454

55+
##Performance
56+
It's notoriously hard to approach cuBLAS performance with custom CUDA kernels.
57+
OpenAI kernels for example make ample use of assembly language to achieve a good performance.
58+
59+
The promise of Cutlass was to provide tools that abstract the different parts of CUDA kernels using smart C++ templates.
60+
61+
This allows the `pytorch_block_sparse` library to achieve roughly 50% of cuBLAS performance:
62+
depending on the exact matrix computation, it achieves 40% to 55% of the cuBLAS performance on large matrices
63+
(which is the case when using large batch x sequence sizes in Transformers for example).
64+
Practically, this means that a Transformer with BlockSparseLinear with a 50% sparsity is as fast as the dense version.
65+
This may be improved in next releases, especially when newer version of Cutlass are used.
66+
5567
## Future work
5668
- Implement some paper methods (and provide new ones) to optimize the sparse pattern during training, while doing the classic parameter optimization using backprop. The basic idea is to remove some smaller magnitude weights (or blocks of weights) at some positions and try other ones.
5769
- [Movement Pruning: Adaptive Sparsity by Fine-Tuning](https://arxiv.org/abs/2005.07683)
@@ -66,3 +78,6 @@ In the root directory just execute:
6678
```
6779
python setup.py install
6880
```
81+
82+
# Development Notes
83+
You will find them ([here](doc/DevNotes.md))

Diff for: doc/DevNotes.md

+95
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Development Notes
2+
3+
4+
This python package provides a PyTorch extension .
5+
6+
7+
## Organisation
8+
### Build
9+
10+
The setup.py script use the standard PyTorch extension mechanism to build the package:
11+
12+
```
13+
from torch.utils.cpp_extension import BuildExtension, CUDAExtension
14+
...
15+
ext_modules=[
16+
CUDAExtension('block_sparse_native',
17+
['pytorch_block_sparse/native/block_sparse_native.cpp',
18+
'pytorch_block_sparse/native/block_sparse_cutlass_kernel_back.cu',
19+
'pytorch_block_sparse/native/block_sparse_cutlass_kernel.cu'],
20+
extra_compile_args=['-I', '%s/pytorch_block_sparse' % rootdir]
21+
),
22+
],
23+
cmdclass={
24+
'build_ext': BuildExtension
25+
}
26+
```
27+
28+
### Native functions python interface
29+
A single c++ file `block_sparse_native.cpp` provides the native functions visible from python.
30+
These functions provides access to CUDA kernels which computes :
31+
- dense x native -> dense
32+
- dense x dense on sparse support -> sparse
33+
34+
### CUDA/Cutlass kernels
35+
The `*.cu` files in the `native` directory provides the kernel themselves.
36+
They are using the cutlass primitives available in the `cutlass` subdirectory.
37+
38+
Multiple levels of C++ templating provides dispatch/code generation of the kernels.
39+
40+
The main files in the `cutlass/gemm` directory are `block_task.h` and `block_task_back.h` .
41+
They express the final CUDA kernel that will be executed, using
42+
- `block_loader_.*` to load A and B matrix tiles in an efficient way
43+
- `thread_accumulator.h` to store the result tiles 'R'
44+
- `epilogue_function` to combine R with C `C' = alpha * R + beta * C`
45+
- `grid_raster_.*` to list the output tiles that must be computed
46+
47+
### block_sparse python module
48+
This library includes as little native code as possible, because native code is hard to write/debug/understand.
49+
50+
The native functions are performing the performance critical tasks, and the python code in `block_sparse.py` is doing
51+
all the preparatory work, which is executed only once, or a unfrequently.
52+
53+
The main job of `block_sparse.py` is to build indexes into the sparse matrices.
54+
Three sets of sparse indices are built:
55+
- row wise index of non-zero entries (for dense x sparse)
56+
- column wise index of non-zero entries (for dense x sparse with transposition)
57+
- linear list of 2D coordinates of non-zero entries (for dense x dense on sparse support)
58+
59+
These structures are created using standard PyTorch primitives, and so are easy to debug, understand,
60+
or reimplement in other languages.
61+
62+
### block_sparse_linear python module
63+
The block_sparse_linear is a thin layer on top of `block_sparse`
64+
It use the linear algebra primitives of block_sparse to create a drop in replacement for `torch.nn.Linear`,
65+
with the proper back-propagation primitives, implemented using a `torch.autograd.Function` subclass.
66+
67+
## Testing
68+
Debugging CUDA kernels is hard. Fortunately, it's easy to compare the kernel results with
69+
a reference PyTorch implementation.
70+
The `tests` directory provides some code to test and measure performance of the library.
71+
72+
## TODO
73+
74+
block_sparse
75+
- add input parameters sanity checks
76+
- add dispatch for
77+
- different matrix size -> different dispatch strategy (tile sizes in k-dimension)
78+
- different block sizes
79+
80+
tests
81+
- Refactor/cleanup tests
82+
83+
doc
84+
- schema of sparse index structures
85+
86+
cutlass
87+
- move to 2.x version
88+
89+
cleanup algorithms
90+
- add algorithms to measure weights importance and optimize the sparsity pattern
91+
92+
93+
94+
95+

Diff for: pytorch_block_sparse/block_sparse.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -371,7 +371,7 @@ def matmul_with_output_sparse_support_(self, dense_a, dense_b, overwrite_data =
371371
assert(shape_c[0] == shape_a[1])
372372
assert(shape_c[1] == shape_b[1])
373373

374-
blocks_len = len(self.blocks) // 2
374+
blocks_len = self.blocks.shape[0] // 2
375375
block_shape = self.block_shape
376376

377377
assert ((shape_a[1] % block_shape[1]) == 0)

Diff for: setup.py

+3-6
Original file line numberDiff line numberDiff line change
@@ -13,20 +13,17 @@ def readme():
1313
description='pytorch_block_sparse is a python package for fast block sparse matrices computation.',
1414
long_description=readme(),
1515
classifiers=[
16-
'Development Status :: 3 - Alpha',
17-
'License :: OSI Approved :: MIT License',
16+
'Development Status :: 4 - Beta',
17+
'License :: OSI Approved :: BSD 3-Clause "New" or "Revised" License',
1818
'Programming Language :: Python :: 3.0',
19-
'Topic :: Text Processing',
2019
],
2120
keywords='',
2221
url='',
2322
author='',
2423
author_email='',
2524
license='MIT',
2625
packages=['pytorch_block_sparse'],
27-
install_requires=['click'],
28-
test_suite='nose.collector',
29-
tests_require=['nose', 'nose-cover3'],
26+
install_requires=[],
3027
include_package_data=True,
3128
zip_safe=False,
3229
ext_modules=[

0 commit comments

Comments
 (0)