SimplEsT-ViT (E-SPA + TAT) - vanilla transformer (without normalizations and skip connections), E-SPA (gama = 0.005) + TAT (eta = 0.9).
- pytorch 2.0
- wandb (optional)
The dataset contains 110 000 images of 200 classes downsized to 64x64 colored images. Each class has 500 training images (100 000), 50 validation images (10 000).
- Run downloand_tiny_imagenet.py
python3 downloand_tiny_imagenet.py -r data
- Go to https://www.image-net.org/download.php
- Request to download ImageNet
- Create data folder if not alredy exist
mkdir -p data
- Move to data folder
cd data - Download the images from the ILSVRC2012 page
- Training images (Task 1 & 2) 138 GB
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar
- Validation images (all tasks) 6.3 GB
wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar
- Training images (Task 1 & 2) 138 GB
- Run the script extract_ILSVRC.sh from the PyTorch GitHub, extract_ILSVRC.sh (~ double memory)
wget -qO- https://raw.githubusercontent.com/pytorch/examples/main/imagenet/extract_ILSVRC.sh | bash
| Cifar10 (/4) | Cifar100 (/4) | TinyImageNet200 (/8) | ||
|---|---|---|---|---|
| SimpleViT-S | Adam | 0.8334 | 0.5880 | 0.4529 |
| SimplEsT-ViT-S | Adam Shampoo@25 |
0.7936 0.8243 |
0.4687 0.5506 |
0.3847 0.4208 |
- TAT setup: label smoothing + dropout + weight decay.
| Cifar10 (/4) | Cifar100 (/4) | TinyImageNet200 (/8) | ||
|---|---|---|---|---|
| SimpleViT-S | Adam | 0.8733 | 0.6439 | 0.5152 |
| SimplEsT-ViT-S | Adam Shampoo@25 |
0.7894 0.8496 |
0.4776 0.5899 |
0.3966 0.4490 |
- SimpleViT setup: randaugment + mixup + weight decay.
Training for three times longer with Adam matches the SimpleViT-S training loss. In the E-SPA paper, they showed results for training five times longer, but those were from large-scale experiments. However, achieving high validation accuracy is a different story ...
As mentioned in the TAT and DKS papers, "second-order methods" can significantly boost performance. However, it has not been validated for the Transformer architecture (E-SPA).
- Shampoo@25 was ~1.25x slower than Adam.
- The ViT (-S) architecture naming convention can be found here.
- /4 means patch size 4x4, /8 means patch size 8x8.
One block of SimplEsT-ViT consists of one attention layer (without projection) and 2 linear layers in the MLP block. Thus, the "effective depth" is 64 * 3 + 2 = 194 (2 = patch embedding + classification head). It is impressive to train such a deep vanilla transformer only with proper initialization.
-
Epochs: 90
-
WarmUp: 75 steps
-
Batch size: 2048
-
Gradient cliping: 1
-
Learning scheduler: Cosine with linear wurmup
-
Dropout: {0, 0.2}
-
Weight decay: {0, 0.00005}
-
Optimizer:
- Adam, Learning rate:
- SimplEsT-ViT - {0.0005, 0.0003}
- SimpleViT - 0.001
- Shampoo, Learning rate:
- SimplEsT-ViT - {0.0007, 0.0005}
- Adam, Learning rate:
-
Tat setup:
- Label smoothing: 0.1
- Dropout: {0, 0.2}
-
ViT setup:
- RandAugment: level 10
- Mixup: probability 0.2
It would be beneficial to perform a wider range of experiments to determine the optimal learning rate and weight decay values, particularly for weight decay. This is especially relevant given that the normalization achieved through LN makes the network scale-invariant, resulting in a weight decay exhibiting distinct behavior compared to networks without normalization. We hypothesize that weight decay should be considerably lower than for SimpleViT.
We use the same implementation for Shampoo (except one small change) as in the Cramming paper, where they show no benefits. They hypothesize that it may be due to improper implementation. However, based on my understanding, the discussion is about the Newton iteration method, which is not used as default in the Shampoo implementation we use (default is eigendecomposition).
We also tried Newton's method with the tricks mentioned here. Nevertheless, most of the time, Newton's method didn't converge. Performance was more or less the same (or worse) as eigendecomposition, and it was slower too. That's why, we stick with eigendecomposition.
I want to thank KInIT for supporting the training costs of experiments. All experiments were done on RTX 3090.
| 90 epoch | ||
|---|---|---|
| SimplEsT-ViT-S/161 | TAT setup SimpleViT setup |
0.7053 0.7071 |
It is important to mention that we were unable to conduct a parameter sweep due to computational limitations. Instead, we tested three different learning rates - specifically, 0.0007, 0.0005, and 0.0003. However, we found that the first two led to divergence. Therefore, we believe our present results have room for further improvement.
Hyperparameters:
-
Epochs: 90
-
WarmUp: 10 000 steps
-
Batch size: 1024
-
Gradient cliping: 1
-
Learning scheduler: Cosine with linear wurmup
-
Optimizer: Shampoo@25
-
Learning rate: 0.0003
-
Weight decay: 0.00001
Tat setup:
- Label smoothing: 0.1
- Dropout: 0.2
SimpleViT setup:
- RandAugment: level 10
- Mixup: probability 0.2
- E-SPA - Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
- TAT - Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers
- DKS - Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping
- SimpleViT - Better plain ViT baselines for ImageNet-1k
- Cramming - Cramming: Training a Language Model on a Single GPU in One Day
- Shampoo - Shampoo: Preconditioned Stochastic Tensor Optimization, Scalable Second Order Optimization for Deep Learning
- Adam - Adam: A Method for Stochastic Optimization
Footnotes
-
~20 M parameters. ↩
