Today is may 16th, 2025. I'll TRY to build a multilayer perceptron from the ground up with rust. This repository will only go public if i succeed :)
This is suposed to be a development diary, to keep me inspired. Nothing else. Lower any expectations.
I will use the MNIST dataset to train my perceptron to be able to recognize handwritten digits.
What can go wrong? The MLP general architecture is fairly simple and this problem in specific was already tackled many times, so there will be lots of reference material available.
This entire project will be mainly based on three blue one brown's video series on Neural Networks. So every time i refer to our "research source", i'll be talking about this series.
To my surprise, the MNIST dataset comes in an alien format. It appears to be provided as raw binary big-endian integers (?). I was expecting a folder with tons of 28x28 jpegs and this starts to bother me a little bit. How am i supposed to read this?
After being in shock for a while, a did some research and it turns out the binary format of the MINIST dataset will make my job A LOT easier.
¯\(ツ)/¯
After reading a stack overflow post and asking some questions to gepetto (chat gpt's pet name), i kinda have figured out how to read the images.
These are the first bytes. They're headers giving general information about the following dataset.
| Offset* | Meaning |
|---|---|
| 0–3 | Magic number (2051) |
| 4–7 | Number of images (60000) |
| 8–11 | Number of rows (28) |
| 12–15 | Number of columns (28) |
*the offset is in bytes
After that, we will have all the 28x28 bitmap images, where each pixel is a single byte integer (0-255), representing a grayscale. That's why i said the binary format of the dataset with its bitmap grayscale nature would come in hand. jpeg images would need to be pre processed to turn each pixel into an integer, but now we can skip this part. It's just like if the dataset was specially designed to this algorithm.
Since Rust already provides functionalities to read binary data, i just had to write a simple wrapper function that receives a reader and starts appending 32bits integers to it:
fn read_u32_big_endian(reader: &mut impl Read) -> io::Result<u32> {
let mut buf = [0u8; 4];
reader.read_exact(&mut buf)?;
Ok(u32::from_be_bytes(buf))
}After opening the file and moving the buffer cursor away from the headers and start reading the images, its time to render them. For that, we get each pixel integer value and render one of five ascii characters representing different grayscales:
fn render_image(image: &[u8], mode: char) {
const WIDTH: usize = 28;
const HEIGHT: usize = 28;
const GREYSCALE: [char; 5] = ['█', '▓', '▒', '░', ' ']; // dark to light
const NEGATIVE_GREYSCALE: [char; 5] = [' ', '░', '▒', '▓', '█'];
for y in 0..HEIGHT {
for x in 0..WIDTH {
let pixel = image[y * WIDTH + x];
let idx = ((pixel as usize * (GREYSCALE.len() - 1)) / 255) as usize;
let block = if mode == 'n' {
NEGATIVE_GREYSCALE[NEGATIVE_GREYSCALE.len() - 1 - idx]
} else {
GREYSCALE[GREYSCALE.len() - 1 - idx]
};
print!("{}", block);
}
println!();
}
}Now we are able to see them on terminal (the green background is my terminal's):
And that is clearly a two, ladies and gentleman.
As the name suggests, the Multilayer Perceptron has multiple layers (🤯). Ours will have 4 layers, which must be enough to the task (we can experiment multiple variants later on). The first layer will be composed by 784 neurons (the amount of an image's pixels), followed by two 16-neuron layers, and a final 10 neuron layer, corresponding to each possible label (0 to 9). We will not focus on models scalability, although it would be nice to have a flexibe MLP, with a customizable number and size of layers so on so forth. Our first goal is simply to build the perceptron and correctly classify the dataset images (with good accuracy i mean), so here it goes our base struct:
pub struct MLP {
input_layer: [f64; 784],
hidden_layer_0: [f64; 16],
hidden_layer_1: [f64; 16],
output_layer: [f64; 10],
weights_matrix: [[f64; 784]; 16]
}
impl MLP {
pub fn new() -> Self {
MLP {
input_layer: [0.0; 784],
hidden_layer_0: [0.0; 16],
hidden_layer_1: [0.0; 16],
output_layer: [0.0, 10],
weights_matrix: [[0.0; 784]; 16]
}
}
}I chose f64 numbers to have a nice precision, although having a feeling that this is overkill. I am also concerned about potencial performance issues with those 64bits number all over the neurons, weights and biases, but will see how it goes...
According to my source, the connections between neurons are, internally, simply a number that defines how strongly related they are. Since each neuron from a layer connects with all the ones in the next, we are going to have a lot of those f64 numbers flying aroung. (784x16) + (16x16) + (16*10) = 12960 weight values just between the first two layers (with more from neurons themselves)
As recommended, we are going to represent them in a single matrix: weights_matrix: [[f64; 784]; 16]
The first thing we need to do is randomize the weights matrix. Currently they're all set to zero, which results in dead activation between all conections. A dead brain, so to speak, since all the operations would result in zeros. Unfortunately, rust does not provide a randomizer solution, so we're going to use an external crate, called rand.
We also create a print function, so here it is one of our weights matrix:
Weights Between Layers 1 & 2:
0.4 0.6 -0.9 -0.9 0.5 -0.7 0.1 -0.1 -0.8 0.7 -0.1 0.0 0.7 -0.0 0.1 -0.4
-0.3 0.4 -0.5 0.9 0.1 0.3 -0.6 -0.9 0.8 0.8 0.7 0.7 -0.4 -1.0 0.2 0.6
0.5 -0.8 0.0 0.6 0.6 -0.9 -0.7 -0.9 0.4 0.2 0.7 -1.0 -0.4 -0.3 0.8 0.5
0.2 0.8 -0.7 -0.0 -0.5 0.7 -0.2 0.3 0.6 -0.6 -0.9 -0.3 -0.2 -0.1 -0.7 -0.1
-0.6 0.5 -0.7 -0.6 0.6 0.6 0.4 0.6 -0.1 0.5 -0.2 0.8 0.6 -0.1 -0.4 0.1
0.3 0.9 -0.2 0.0 -0.0 0.5 0.7 -0.5 0.2 -0.9 0.9 -0.0 -1.0 0.7 0.5 -0.3
-0.0 0.4 -0.8 0.4 -0.1 0.8 -0.4 0.1 0.6 0.3 -0.0 -0.1 -0.5 -0.0 0.7 0.2
-0.7 0.6 1.0 0.9 0.8 0.0 -0.6 0.1 -0.9 -0.8 0.6 -0.8 -0.4 -0.7 -0.7 -0.4
0.7 0.3 0.6 0.7 -0.8 -0.7 -0.8 0.1 0.8 -0.8 -0.9 -0.6 0.4 -0.6 -0.8 0.4
0.3 -0.8 -0.1 -0.6 1.0 -0.8 -0.2 -0.6 0.3 -0.3 -0.5 -0.9 -0.6 0.5 -0.5 -0.2
0.6 -0.6 0.1 0.8 0.1 -0.6 0.2 -0.5 0.8 -0.8 -0.9 0.9 -0.4 0.8 -0.0 0.2
0.8 0.9 -0.1 0.8 0.7 0.2 0.7 0.7 -0.3 -0.7 -0.7 0.1 -0.6 0.1 0.3 0.5
-0.6 0.1 -0.9 -0.3 0.1 1.0 0.3 -0.8 0.1 -0.0 -0.5 0.5 -1.0 0.2 0.9 0.3
-0.6 0.5 0.3 -0.3 0.8 0.4 0.2 0.9 0.3 0.5 0.2 -0.4 0.4 0.7 -0.6 -0.8
-0.7 1.0 -0.8 0.7 0.1 0.2 -0.4 0.9 0.8 -0.4 -0.6 -0.3 -0.8 0.1 0.3 -0.4
-0.3 0.3 0.5 -0.1 -0.8 0.8 -0.3 -0.1 -1.0 -0.6 0.8 0.0 -1.0 -0.1 0.3 1.0As of now, our baby perceptron is, at a structural level, entirely built! We already have all the neurons and connections set up. The next step is to create a couple of things to make it functional, eg. the activation function, feed foward etc. Before that, we need to go back to our io.rs module.
The entire purpose of our first perceptron layer is to represent the image it's currently being analyzed (or looked at), hence the layer size being tightly linked to the image's. Here we have a mismatch. Each neuron receives data from a pixel, which is encoded in a 8bit number, ranging from 0-255. But a neuron will only hold values between 0 and 1. We need a way of maping those pixels values to something that can be represented inside our neurons.
Since values between 0 and 1 can be understood as a percentage, and a pixel is always in between the concepts of entirely lit (at its brightest, 100%) and entirely dimmed (at its darkest, 0%), we can obtain the percentage of how much bright each pixel has by simply dividing its value to 255. Now we have a normalizing function:
pub fn normalize_image(image: [u8; 784]) -> [f64; 784] {
let mut normalized_image: [f64; 784] = [0.0; 784];
for (i, &pixel_byte) in image.iter().enumerate() {
normalized_image[i] = pixel_byte as f64 / 255.0;
}
normalized_image
}I'm not sure if this perceptron will be that hard to process, since it is very simple. But even this simple perceptron will require millions, maybe billions of matrices operations. I could implement methods for those operations myself, but i will use a library instead, who implements performance improvements in those kinds of operations, potencially reducing training time. After a couple of minutes i've just adapted a few files to make use of our new library instead of pure arrays to store matrices.
With all of our neurons and connections setup. Functions to read, normalize and feed the dataset to the input layer, and all the matrices operations setup, our perceptron is finally able to see and (try to) classify images, despite being really dump right now. Also, a few normalizing operations are still missing, so we can't quite extract any kind of reason from its outputs (yet).
In addittion to that, we are still missing a lot of training functions and related features to support this process. I'll be handling this very soon.
As discussed before, while we were creating the function to initialize the weights matrices with random numbers, if our perceptron starts its weights as zeros, the output will always be a vector of zeros:
// Output Neurons
[[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]]After randomizing our weights:
[[2.832064153297856, -7.189384950817572, 29.26444281394776, 38.91743873415687, -15.684375189411712, -11.421555730798271, -3.4437869072569565, -20.747108787807665, 5.599649623093166, -9.700232613382788]]We had to create a softmax function to turn the output values in a probability table and a ReLU to normalize neuron activation, finishing or model 'structurewise'. It kind of worked pretty well. We're now in a very important step, because the only thing separating us from absolute success is figuring out how to train it. Right know it is only randomly guessing the output.
Have a look at our prediction function:
fn apply_relu<const N: usize>(layer: &mut SVector<f64, N>) {
for val in layer.iter_mut() {
*val = val.max(0.0);
}
}
fn apply_softmax<const N: usize>(layer: &SVector<f64, N>) -> SVector<f64, N> {
let max_val = layer.max(); // for numerical stability
let exps = layer.map(|x| (x - max_val).exp());
let sum: f64 = exps.iter().sum();
exps / sum
}
// It is good the image is not borrowed, since after prediction, the value can be dropped
pub fn predict(&mut self, image: [f64; INPUT_SIZE]) -> SVector<f64, OUTPUT_SIZE> {
self.input_layer = SVector::<f64, INPUT_SIZE>::from_row_slice(&image);
self.hidden_layer_0 = self.weights_matrix_01 * self.input_layer;
// Relu must be applied in every layer after firing neurons
MLP::apply_relu(&mut self.hidden_layer_0);
self.hidden_layer_1 = self.weights_matrix_12 * self.hidden_layer_0;
MLP::apply_relu(&mut self.hidden_layer_1);
self.output_layer = self.weights_matrix_23 * self.hidden_layer_1;
MLP::apply_softmax(&self.output_layer)
}So after running our current main.rs:
fn main() -> Result<(), Box<dyn std::error::Error>> {
let mut mlp = MLP::new();
mlp.show_weights();
mlp.randomize_weights();
mlp.show_weights();
let image: [u8; 784] = io::read_image()?;
let normalized_image: [f64; 784] = parsers::normalize_image(image);
let prediction = mlp.predict(normalized_image);
println!("{:?}", prediction);
io::render_output(&prediction);
Ok(())
}We finally get our fist guess in the form of a probabilities distribution, guessing 5. The correct answer was 8 :(
[[2.539143790771743e-15, 2.412713529433148e-9, 2.422030173962812e-7, 0.0018935008110995345, 3.404185029212877e-8, 0.9161096634723647, 6.989882245023235e-5, 0.06949902885174457, 3.39904005601683e-11, 0.012427629350766809]]
0 |
1 |
2 |
3 |
4 |
5 | ████████████████████████████████████████████████████████████████████████████████████████████
6 |
7 | ███████
8 |
9 | █I was thinking that biases were useless. Or at least, i wasn't expecting the hugeness of their importance. Fortunately, i got curious and started studying the theory behind them, and it's absolute cinema, really.
I recommend reading this medium article as a starting point. Maybe it fires your curiosity as it did to mine.
Since the goal here was never to get deep into the theory (this is a development diary), i won't discuss the ins n' outs of biases' effects over MLPs, even because i can barely glimpse it. In pratical terms, we'll need to add one more vector in each layer to be summed with the weights*input activation value.
Our struct changed a lot since the last time, so its good to have a look again:
pub struct MLP {
input_layer: SVector<f64, INPUT_SIZE>, // 784 x 1
hidden_layer_0: SVector<f64, HIDDEN_SIZE>, // 16 x 1
hidden_layer_1: SVector<f64, HIDDEN_SIZE>, // 16 x 1
output_layer: SVector<f64, OUTPUT_SIZE>, // 10 x 1
weights_matrix_01: SMatrix<f64, HIDDEN_SIZE, INPUT_SIZE>, // 16 x 784
weights_matrix_12: SMatrix<f64, HIDDEN_SIZE, HIDDEN_SIZE>, // 16 x 16
weights_matrix_23: SMatrix<f64, OUTPUT_SIZE, HIDDEN_SIZE>, // // 10 x 16
hidden_bias_0: SVector<f64, HIDDEN_SIZE>,
hidden_bias_1: SVector<f64, HIDDEN_SIZE>,
output_bias: SVector<f64, OUTPUT_SIZE>,
}We need to better structure the function responsible for reading the images. Right know, it just iter over the entire dataset, printing the images in stdout and returning the last one read. This function will have to provide an image at a time. We are going to create a struct, implementing the trait Iterator, so we can loop over the images and labels at once:
pub struct MNISTReader {
image_reader: BufReader<File>,
label_reader: BufReader<File>,
index: usize,
total: usize,
}
impl MNISTReader {
pub fn new(image_path: &Path, label_path: &Path) -> io::Result<Self> {
let mut image_file = BufReader::new(File::open(image_path)?);
let mut label_file = BufReader::new(File::open(label_path)?);
// Skiping the headers
let mut image_header = [0u8; 16];
let mut label_header = [0u8; 8];
image_file.read_exact(&mut image_header)?;
label_file.read_exact(&mut label_header)?;
let total = u32::from_be_bytes([image_header[4], image_header[5], image_header[6], image_header[7]]) as usize;
Ok(
Self {
image_reader: image_file,
label_reader: label_file,
index: 0,
total,
}
)
}
}
impl Iterator for MNISTReader {
type Item = io::Result<([u8; 784], u8)>;
fn next(&mut self) -> Option<Self::Item> {
if self.index >= self.total {
return None;
}
let mut image = [0u8; 784];
let mut label = [0u8; 1];
if let Err(e) = self.image_reader.read_exact(&mut image) {
return Some(Err(e));
}
if let Err(e) = self.label_reader.read_exact(&mut label) {
return Some(Err(e));
}
self.index += 1;
Some(Ok((image, label[0])))
}
}Confession time: The training function is entirely LLM made. I understood the training strategy in theory, mentioned in our source video. But i wasn't very sure about how to implement it correctly, and to save timem i asked my friend gepetto.
Surprisingly everything went awful. After iterating over the entire 60.000 images, our perceptron could barely make a single correct prediction. Then, making some research, i came across the fact that, usually, a single pass through the data is not enough, and that's the concept of epochs, so i tried running over the dataset multiple times and:
Epoch 01 Accuracy: 14.01%
Epoch 02 Accuracy: 9.24%
Epoch 03 Accuracy: 9.24%
Epoch 04 Accuracy: 9.24%
Epoch 05 Accuracy: 9.24%
Epoch 06 Accuracy: 9.24%
Epoch 07 Accuracy: 9.24%
Epoch 08 Accuracy: 9.24%
Epoch 09 Accuracy: 9.24%
Epoch 10 Accuracy: 9.24%At least i got to see that the predictions were entirely random. The mlp has to choose one of ten labels. The random proportion is, therefore, 10%, which is what we see over and over throughout epochs. In our source video, our narrator mentions the idea of training functions updating weights, ideally, only after iterating over the entire dataset. This is problematically performancewise, so it is recommended to update weights in smaller intervals. Following this guideline, i had the brilliant idea of executing the update function once per image. In (yet) another research, i found out that this strategy arises some kind of noisy behavior, where the direction of improvement can't be found as easily by the perceptron.
Another issue was my randomize_weights function. The initial idea was to set all the weights to random values between -1 and 1, which was too wide of a range. After implementing the concept of batches as a group of images after which the weights are updated and tightening the random weights interval, we got massive improvement in our perceptron:
Epoch 01 Accuracy: 74.21%
Epoch 02 Accuracy: 89.10%
Epoch 03 Accuracy: 90.84%
Epoch 04 Accuracy: 91.93%
Epoch 05 Accuracy: 92.60%
Epoch 06 Accuracy: 93.09%
Epoch 07 Accuracy: 93.42%
Epoch 08 Accuracy: 93.66%
Epoch 09 Accuracy: 93.89%
Epoch 10 Accuracy: 94.05%The way we chose to build our perceptron, we cannot alter the amount of layers, but we managed to keep it flexible in regard to the amount of neurons in each layer. We can simply alter the constant HIDDEN_SIZE to 32 instead of 16, in our constants.rs:
pub const INPUT_SIZE: usize = 784;
pub const HIDDEN_SIZE: usize = 32;
pub const OUTPUT_SIZE: usize = 10;
pub const LABEL_SIZE: usize = 1;We can add another constant in the future so we can have any given amount of neurons for each of the two hidden layers, but lets check the impact in our perceptron training statistics:
Epoch 01 Accuracy: 82.22%
Epoch 02 Accuracy: 91.50%
Epoch 03 Accuracy: 92.84%
Epoch 04 Accuracy: 93.64%
Epoch 05 Accuracy: 94.23%
Epoch 06 Accuracy: 94.71%
Epoch 07 Accuracy: 95.05%
Epoch 08 Accuracy: 95.37%
Epoch 09 Accuracy: 95.63%
Epoch 10 Accuracy: 95.90%It took a lot more time to train, but we can see the perceptron is more assertive while training when the number of neurons is increased in the hidden layers. This makes a lot of sense, since the hidden layer's neurons are responsible for capturing image's features. The largest is the amount of neurons, the largest is expected to be the amount of nuances they can represent, therefore achieving better results.
We still have a few problems. First we need to add functions to store the weights in some kind of .txt file and to load them as well. We'll also need to come up with a way of creating new images equivalent to MNIST dataset, so i can show off to some of my friends by asking them to write some number and inputing it to our perceptron.
I've built a dynamic version of our MLP. It makes use of nalgebra DVectors and DMatrixes, wich uses heap allocated vectors instead of stack allocated arrays. The result was expected but yet fun to see: the same static MLP, with an identical amount of layers and neurons to the dynamic one, has around 15% better performance. Just by using Stack instead of Heap!!!
Of course we have a tradeoff, versatility vs performance.
So, i've built function to convert .png to binary that can be understand by our MLPs, also a load_weights() and a save_weights() function so we don't have to train the perceptron every time we run our program. Unfortunately, even with our 95% accuracy in the training set, we are still not able to predict users input easily. Therefore we will try to tweak our perceptron to achieve greater performance.
Note: We're using the SMLP (the static version). Cause it is faster and more than enough to get to our goal :)
To do so, we are going to increase the number of neurons on the hidden layers. Now, we'll have [784, 64, 32, 10] instead of the classic [784, 16, 16, 10].
Also, have a look at our new training constants:
| Parameter | Old Configuration | New Configuration |
|---|---|---|
BATCH_SIZE |
32 | 64 |
EPOCHS |
10 | 20 |
LEARNING_RATE |
0.01 | 0.01 |
This did not have the expected outcome. In our second try, we tried shuffling the training data to prevent biases and altered the amount of neurons in each layer. The shuffling didn't seem to have had any impact over performance, but the idea was not improving performance with it, we were focusing in reducing the probability of training bias.
Changing the structure of our MLP actually gave us a little improvement. We also tried using the dynamic version of our network (with a dynamic number of layers, remember? DMLP) with one more layer so we could reduce its neurons at each layer by a factor of four, instead of by a factor of 7 chosen for our static mlp (SMLP):
Training Accuracy Comparison
Static MLP Layers: [784, 112, 16, 10]
Dynamic MLP Layers: [784, 196, 49, 16, 10]
| Epoch | Dynamic MLP | Static MLP |
|---|---|---|
| 01 | 81.67% | 78.71% |
| 02 | 92.77% | 91.22% |
| 03 | 94.38% | 93.04% |
| 04 | 95.41% | 94.13% |
| 05 | 96.16% | 94.84% |
| 06 | 96.66% | 95.44% |
| 07 | 97.08% | 95.82% |
| 08 | 97.50% | 96.27% |
| 09 | 97.71% | 96.59% |
| 10 | 97.92% | 96.83% |
* we chose to reduce the epochs amount back to ten, since the performance gain wasn't significant
