SimplEsT-ViT

SimplEsT-ViT (E-SPA + TAT) - vanilla transformer (without normalizations and skip connections), E-SPA (gama = 0.005) + TAT (eta = 0.9).

Dependencies:

pytorch 2.0
wandb (optional)

Data:

TinyImageNet200:

The dataset contains 110 000 images of 200 classes downsized to 64x64 colored images. Each class has 500 training images (100 000), 50 validation images (10 000).

Run downloand_tiny_imagenet.py

python3 downloand_tiny_imagenet.py -r data

ImageNet-1k:

Go to https://www.image-net.org/download.php
Request to download ImageNet
Create data folder if not alredy exist
```
mkdir -p data
```
Move to data folder
```
cd data
```

Download the images from the ILSVRC2012 page

Training images (Task 1 & 2) 138 GB

wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_train.tar

Validation images (all tasks) 6.3 GB

wget https://image-net.org/data/ILSVRC/2012/ILSVRC2012_img_val.tar

Run the script extract_ILSVRC.sh from the PyTorch GitHub, extract_ILSVRC.sh (~ double memory)

wget -qO- https://raw.githubusercontent.com/pytorch/examples/main/imagenet/extract_ILSVRC.sh | bash

Results:

TAT setup:

Cifar10 (/4)

Cifar100 (/4)

TinyImageNet200 (/8)

SimpleViT-S

Adam

0.8334

0.5880

0.4529

SimplEsT-ViT-S

Adam

Shampoo@25

0.7936

0.8243

0.4687

0.5506

0.3847

0.4208

TAT setup: label smoothing + dropout + weight decay.

SimpleViT setup:

Cifar10 (/4)

Cifar100 (/4)

TinyImageNet200 (/8)

SimpleViT-S

Adam

0.8733

0.6439

0.5152

SimplEsT-ViT-S

Adam

Shampoo@25

0.7894

0.8496

0.4776

0.5899

0.3966

0.4490

SimpleViT setup: randaugment + mixup + weight decay.

Training for three times longer with Adam matches the SimpleViT-S training loss. In the E-SPA paper, they showed results for training five times longer, but those were from large-scale experiments. However, achieving high validation accuracy is a different story ...

As mentioned in the TAT and DKS papers, "second-order methods" can significantly boost performance. However, it has not been validated for the Transformer architecture (E-SPA).

Shampoo@25 was ~1.25x slower than Adam.
The ViT (-S) architecture naming convention can be found here.
/4 means patch size 4x4, /8 means patch size 8x8.

Trainability of deeper SimplEsT-ViT:

Model was trained on Cifar10 with Adam optimizer.

One block of SimplEsT-ViT consists of one attention layer (without projection) and 2 linear layers in the MLP block. Thus, the "effective depth" is 64 * 3 + 2 = 194 (2 = patch embedding + classification head). It is impressive to train such a deep vanilla transformer only with proper initialization.

Experiments setup:

Epochs: 90
WarmUp: 75 steps
Batch size: 2048
Gradient cliping: 1
Learning scheduler: Cosine with linear wurmup
Dropout: {0, 0.2}
Weight decay: {0, 0.00005}
Optimizer:
- Adam, Learning rate:
  - SimplEsT-ViT - {0.0005, 0.0003}
  - SimpleViT - 0.001
- Shampoo, Learning rate:
  - SimplEsT-ViT - {0.0007, 0.0005}
Tat setup:
- Label smoothing: 0.1
- Dropout: {0, 0.2}
ViT setup:
- RandAugment: level 10
- Mixup: probability 0.2

It would be beneficial to perform a wider range of experiments to determine the optimal learning rate and weight decay values, particularly for weight decay. This is especially relevant given that the normalization achieved through LN makes the network scale-invariant, resulting in a weight decay exhibiting distinct behavior compared to networks without normalization. We hypothesize that weight decay should be considerably lower than for SimpleViT.

Shampoo implementation discusion:

We use the same implementation for Shampoo (except one small change) as in the Cramming paper, where they show no benefits. They hypothesize that it may be due to improper implementation. However, based on my understanding, the discussion is about the Newton iteration method, which is not used as default in the Shampoo implementation we use (default is eigendecomposition).

We also tried Newton's method with the tricks mentioned here. Nevertheless, most of the time, Newton's method didn't converge. Performance was more or less the same (or worse) as eigendecomposition, and it was slower too. That's why, we stick with eigendecomposition.

Acknowledgment:

I want to thank KInIT for supporting the training costs of experiments. All experiments were done on RTX 3090.

ImageNet-1k Results:

90 epoch

SimplEsT-ViT-S/16¹

TAT setup

SimpleViT setup

0.7053

0.7071

It is important to mention that we were unable to conduct a parameter sweep due to computational limitations. Instead, we tested three different learning rates - specifically, 0.0007, 0.0005, and 0.0003. However, we found that the first two led to divergence. Therefore, we believe our present results have room for further improvement.

Hyperparameters:

Epochs: 90
WarmUp: 10 000 steps
Batch size: 1024
Gradient cliping: 1
Learning scheduler: Cosine with linear wurmup
Optimizer: Shampoo@25
Learning rate: 0.0003
Weight decay: 0.00001

Tat setup:
- Label smoothing: 0.1
- Dropout: 0.2
SimpleViT setup:
- RandAugment: level 10
- Mixup: probability 0.2

References:

~20 M parameters. ↩

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
assests		assests
models		models
shampoo		shampoo
tricks		tricks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
bench.py		bench.py
downloand_tiny_imagenet.py		downloand_tiny_imagenet.py
sweep.yaml		sweep.yaml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SimplEsT-ViT

Table of Contents:

Dependencies:

Data:

TinyImageNet200:

ImageNet-1k:

Results:

TAT setup:

SimpleViT setup:

Trainability of deeper SimplEsT-ViT:

Experiments setup:

Shampoo implementation discusion:

Acknowledgment:

ImageNet-1k Results:

References:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SimplEsT-ViT

Table of Contents:

Dependencies:

Data:

TinyImageNet200:

ImageNet-1k:

Results:

TAT setup:

SimpleViT setup:

Trainability of deeper SimplEsT-ViT:

Experiments setup:

Shampoo implementation discusion:

Acknowledgment:

ImageNet-1k Results:

References:

Footnotes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages