A deep learning library implemented from scratch in ANSI C99 C, including a tensor manipulation library, a library for composing neural nets on top of this library. Also conatains an application of this library that approximately reproduces a paper (verifying against an existing Pytorch repro of the same paper) and conducts some further experiments in the process.
This comprises the functions included in the tensor.h
file (backed by the tensor.c
implementation). A summary of the library (also documented in the header file) is as follows:
This tensor library supports a strided implementation of tensors; this allows us to get many types of views into a tensor for cheap and perform operations like the transpose, permuting some axes in the tensor, reshaping, broadcasting, etc. without allocating new memory. In general, this library also skews towards catching buggy code more easily but at the cost of some user convenience. A couple of examples are:
- We don't support implicit broadcasting for operations - broadcasting has to be explicit, through the dedicated function for it. This is helpful to catch unintended bugs, albeit at the cost of some convenience to the user.
- Convolutions are not implemented with optimized methods like im2col. They are implemented using naive loops. Note that convolutions of different strides, dilations, and uneven padding are all supported. (Also note that uneven padding is not a necessity - e.g. in Pytorch, you can add extra padding that still leads to the same output size through rounding) but it is useful because it ensures that the programmer has to be aware of and specify explicitly the right padding dimensions that are necessary to get a valid convolution and can prevent some errors in this way.
The tests for this library are in tests/tensor_test.c
. Compile tests from the tests
directory (using gcc, this can be done with: gcc -Wall -Wconversion -pedantic -std=c99 -I.. ../layer.c ../tensor.c tensor_test.c -o tensor_test.out
). Check for memory leaks with: leaks -atExit -- ./tensor_test.out
.
This comprises the functions included in the layer.h
file (backed by the layer.c
implementation). A summary of the library (also documented in the header file) is as follows:
This library provides a set of layers that implement abstractions for common operations in neural networks, and their associated forward and backward pass computations. Some notes:
- Memory Ownership: Each layer owns the memory that it produces through the forward or backward pass, as well as the memory of its parameters - i.e. this memory is created when the allocate function is called and freed when the deallocate function is called. Specifically, this means that outputs, gradients, and parameters are owned by the layer - inputs are not.
- Reshaping: Some layers might seem restrictive because the input has a certain rank (e.g. tanh: rank 2). This can be solved by reshaping as necessary before using these layers. A potential improvement to this library is to have a "ReshapeLayer" to abstract away this operation, since we do this pretty often (e.g. see the demo nets).
- Autodiff: We don't use an autograd-like system because we don't need it at this point. An autodiff implementation would yield benefits such as getting second derivative computations easily (making optimization using second-order methods easier) - but we don't need such capabilities in the current use of this framework. If we wanted to do this in the future, the "layer" abstraction here is fairly similar to the abstraction of a node within a computation graph - the major change from the current implementation would be for a layer's backward pass to be another layer (representing its gradient) and we'd need to support sufficient operations (layers) such that the set of supported operationsk that we support are closed under gradients (e.g. to support the TanhLayer, we would also want to implement an ElementwisePolynomialLayer, since the derivative of tanh(x) is 1-tanh^2(x)).
Similar to the directions for testing the tensor library, compile tests from the tests
directory with gcc -Wall -Wconversion -pedantic -std=c99 -I.. ../layer.c ../tensor.c layer_test.c -o layer_test.out
and check for memory leaks with leaks -atExit -- ./layer_test.out
.
For the convolution layer, it is necessary to try a variety of large and small inputs to cover different cases, and it can get tedious to verify larger test cases by hand. So instead, we write a small script to enable us to compare outputs with equivalent commands in Pytorch. The way to do this is as follows. First, compile the test (after making any changes if necessary) with gcc -Wall -pedantic -std=c99 -I.. ../layer.c ../tensor.c conv_layer_pytorch_comparison_test.c -o conv_layer_pytorch_comparison_test.out
. Then run the test and store the output in a file as follows: ./conv_layer_pytorch_comparison_test.out > conv_layer_pytorch_comparison_test_output.txt
. Then run the Python script to compare the output of the test with the equivalent Pytorch commands (e.g. using python3 conv_layer_pytorch_comparison_test.py
).
As a demo/test of the NN library, we can approximately reproduce the Lecun 1989 paper with this library. Note that this paper was already previously reproduced in Pytorch once in this repository, and we use that as the basis for this work. It is an "approximate" repro for three reasons:
- We deviate in the architecture of the second convolutional layer used in the paper, which is somewhat non-standard. As explained in the repro above, the 12 output channels of that layer comprise three sets of 4, each of which are only connected to 8 out of the 12 input channels to the layer. We do not support this; instead we do a standard convolution from 12 input channels to 8 output channels. Note that this reduces the total number of multiply-accumulates (MACs), the total number of activations (ACTs), and the number of parameters in the model, so we are strictly obtaining a model that uses less computation than the original model rather than more. We get fairly similar (but slightly worse - expected due to the reduction in params) performance due to this change.
- We also don't have any biases in both our convolutional layers, which is done in the paper. The reason for not supporting this is that it doesn't seem to make much difference to the model quality we get at the end. We can see this by experimenting with the parameters in a forked version of the original repro repo (link here, more details given below).
- One more small thing is that our weight init is slightly different - instead of initializing from a uniform distribution, we initialize from a normal distribution. We also initialize weights with
N(0, 1/fan_in)
which does not include the recommended gain factor from the Kaiming normal init or the approximate gain used in the paper. It turns out that adding this factor doesn't make too much of a difference (see experiment below) so I left this as is (if we wanted an exact repro, we could make a slight modification to the code to add this factor in and change from normal to uniform).
- Generate the train and test files in plain-text by running
prepro.py
from the forked Pytorch repro repo. - Compile
repro.c
withgcc -Wall -Wconversion -pedantic -std=c99 -I.. ../tensor.c ../layer.c repro.c -o repro.out
. - Run
./repro.out --repro_conv_net
to train the model, and output model info + train/eval info. - Run
./repro.out --repro_conv_net --epochs 2 --print_variables > conv_comparison_test_output_2.txt
to store a copy of the model state after 2 epochs inconv_comparison_test_output_2.txt
. - Then run
python3 repro.py --file-to-compare ../c-deep-net/repro-of-lecun1989-repro/conv_comparison_test_output_2.txt --compare-at-epoch 2
to run the Pytorch equivalent of our approximate repro from the forked Pytorch repro repo. This will compare the model state in Pytorch after 2 epochs with our model state in C - and we will see the linePassed comparison test!
in our output, which verifies that the model weights in our C version and the Pytorch version are close to each other after two epochs when started from the same inputs. We will also see model train/eval stats that are very similar to what we saw in step 3, as we would expect. This lets us verify that our implementation is accurate. - We also have demos of two other models. The first is a very simple classification model (no hidden layers - just one linear layer followed by a tanh layer), which we can run with
./repro.out --simple_tanh_regression
. The second is also a convolutional network but simpler - it contains one convolutional layer with tanh activation, followed by one linear layer with tanh activation; we can run it with./repro.out --simple_conv_net
. The results for these two models are described in the sub-section on experiments below. - Note that we can also change the learning rate and epoch using the command line arguments
--learning_rate
and--epochs
respectively. By default, the learning rate is set to 0.03 for all models and the number of epochs is set to 23 (same as in the original paper).
First, we compare the three models that we have, and also include the original Pytorch repro for reference (reporting results at the end of 23 epochs for all of them).
Model | Training Set Loss | Test Set Loss | Training Set Classification Rate | Test Set Classification Rate | MACs | ACTs |
---|---|---|---|---|---|---|
Simple Tanh Regression | 0.074687 | 0.079232 | 0.900699 | 0.901844 | 2560 | 10 |
Simple Conv Net | 0.020208 | 0.041772 | 0.974489 | 0.950174 | 26880 | 778 |
Repro Conv Net | 0.010147 | 0.031500 | 0.985599 | 0.954659 | 61740 | 936 |
Original Pytorch Repro | 0.004073383 | 0.02838382 | 0.9932 | 0.9591 | 63660 | 1000 |
In general, the results align with what we might expect - more complex models perform better, and our repro conv net is fairly close to the original Pytorch repro but expectedly not quite the same (slightly worse) since it includes fewer connections (as mentioned above, our intermediate convolution is a direct 12 channel to 8 channel convolution, instead of the pattern in the paper which splits the 12 input channels into 3 overlapping sets of 8 that it connects to 4 output channels each).
We also note two things we learned from simulating the model in Pytorch:
- We verified that the absence of biases in the convolutional layer doesn't make a noticeable difference in eval accuracies for the original architecture as well as our modified one.
- Not including the additional factor in Kaiming init also doesn't make a noticeable difference in eval accuracies for our modified architecture. Uniform vs normal initialization also does not yield a noticeable difference.
- Convolutions: The implementation of convolutions is naive, which makes the training time for convolutional nets higher than necessary. We are also missing out on a small amount of additional NN library functionality that, if added in, would allow us to support forward and backward passes for convolutions in full generality (currently backward passes are not supported for dilation != 1 - see comments in code for more details). Additionally, we don't support biases in our convolutional layer currently - having test and found that adding biases doesn't really make a difference in the demo model (through equivalent Pytorch models), it wasn't worth including at this point.
- Tooling for graphing training curves, etc. Currently we can evaluate by printing things out but this is not ideal.
- Support uniform weight initialization, have functions to do things like Kaiming init automatically instead of manually computing it.
- The main reference for the demo is of course the original lecun1989-repro repo and the paper it is reproducing.
- The main reference for my implementation of the tensor library was this really nice blogpost on Pytorch internals - this was particularly useful in understanding strided representations and how they give us a way to efficiently take many useful types of views on a tensor.
- These slides on convolutional nets (and a linked IPython notebook) as well as these lecture notes were very helpful to refresh the concept of convolutional nets.
- The way in which this tinygrad library implements MATMMUL using other basic Pytorch operations was very nice (it isn't really related to our implementation, but was interesting to read through).