Visual Attention Network (VAN) paper pdf
This is a PyTorch implementation of VAN proposed by our paper "Visual Attention Network".
Figure 1: Compare with different vision backbones on ImageNet-1K validation set.
@article{guo2022visual,
title={Visual Attention Network},
author={Guo, Meng-Hao and Lu, Cheng-Ze and Liu, Zheng-Ning and Cheng, Ming-Ming and Hu, Shi-Min},
journal={arXiv preprint arXiv:2202.09741},
year={2022}
}
2022.03.15 Supported by Hugging Face.
2022.05.01 Supported by OpenMMLab.
While originally designed for natural language processing (NLP) tasks, the self-attention mechanism has recently taken various computer vision areas by storm. However, the 2D nature of images brings three challenges for applying self-attention in computer vision. (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability. In this paper, we propose a novel large kernel attention (LKA) module to enable self-adaptive and long-range correlations in self-attention while avoiding the above issues. We further introduce a novel neural network based on LKA, namely Visual Attention Network (VAN). While extremely simple and efficient, VAN outperforms the state-of-the-art vision transformers (ViTs) and convolutional neural networks (CNNs) with a large margin in extensive experiments, including image classification, object detection, semantic segmentation, instance segmentation, etc.
Figure 2: Decomposition diagram of large-kernel convolution. A standard convolution can be decomposed into three parts: a depth-wise convolution (DW-Conv), a depth-wise dilation convolution (DW-D-Conv) and a 1×1 convolution (1×1 Conv).
Figure 3: The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; (c) the self-attention module (d) a stage of our Visual Attention Network (VAN). CFF means convolutional feed-forward network. The difference between (a) and (b) is the element-wise multiply. It is worth noting that (c) is designed for 1D sequences. .
Data prepare: ImageNet with the following folder structure.
│imagenet/
├──train/
│ ├── n01440764
│ │ ├── n01440764_10026.JPEG
│ │ ├── n01440764_10027.JPEG
│ │ ├── ......
│ ├── ......
├──val/
│ ├── n01440764
│ │ ├── ILSVRC2012_val_00000293.JPEG
│ │ ├── ILSVRC2012_val_00002138.JPEG
│ │ ├── ......
│ ├── ......
Model | #Params(M) | GFLOPs | Top1 Acc(%) | Download |
---|---|---|---|---|
VAN-Tiny | 4.1 | 0.9 | 75.4 | Google Drive, Tsinghua Cloud, Hugging Face 🤗 |
VAN-Small | 13.9 | 2.5 | 81.1 | Google Drive, Tsinghua Cloud, Hugging Face 🤗 |
VAN-Base | 26.6 | 5.0 | 82.8 | Google Drive, Tsinghua Cloud,Hugging Face 🤗, |
VAN-Large | 44.8 | 9.0 | 83.9 | Google Drive, Tsinghua Cloud, Hugging Face 🤗 |
VAN-Huge | TODO | TODO | TODO | TODO |
Unofficial keras (tensorflow) version.
1. Pytorch >= 1.7
2. timm == 0.4.12
We use 8 GPUs for training by default. Run command (It has been writen in train.sh):
MODEL=van_tiny # van_{tiny, small, base, large}
DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.1, 0.2] for [tiny, small, base, large]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 bash distributed_train.sh 8 /path/to/imagenet \
--model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH
Run command (It has been writen in eval.sh) as:
MODEL=van_tiny # van_{tiny, small, base, large}
python3 validate.py /path/to/imagenet --model $MODEL \
--checkpoint /path/to/model -b 128
Our implementation is mainly based on pytorch-image-models and PoolFormer. Thanks for their authors.
This repo is under the Apache-2.0 license. For commercial use, please contact the authors.