This paper aims to explore large-scale models in computer vision. The authors tackle three major issues in training and application of large vision models, including training instability, resolution gaps between pre-training and fine-tuning, and hunger on labelled data. Three main techniques are proposed: 1) a residual-post-norm method combined with cosine attention to improve training stability; 2) A log-spaced continuous position bias method to effectively transfer models pre-trained using low-resolution images to downstream tasks with high-resolution inputs; 3) A self-supervised pre-training method, SimMIM, to reduce the needs of vast labeled images. This model set new performance records on 4 representative vision tasks, including ImageNet-V2 image classification, COCO object detection, ADE20K semantic segmentation, and Kinetics-400 video action classification.[1]
Figure 1. Architecture of Swin Transformer V2 [1]
Our reproduced model performance on ImageNet-1K is reported as follows.
- ascend 910* with graph mode
model | top-1 (%) | top-5 (%) | params (M) | batch size | cards | ms/step | jit_level | recipe | download |
---|---|---|---|---|---|---|---|---|---|
swinv2_tiny_window8 | 81.38 | 95.46 | 28.78 | 128 | 8 | 335.18 | O2 | yaml | weights |
- ascend 910 with graph mode
model | top-1 (%) | top-5 (%) | params (M) | batch size | cards | ms/step | jit_level | recipe | download |
---|---|---|---|---|---|---|---|---|---|
swinv2_tiny_window8 | 81.42 | 95.43 | 28.78 | 128 | 8 | 317.19 | O2 | yaml | weights |
- Top-1 and Top-5: Accuracy reported on the validation set of ImageNet-1K.
Please refer to the installation instruction in MindCV.
Please download the ImageNet-1K dataset for model training and validation.
- Distributed Training
It is easy to reproduce the reported results with the pre-defined training recipe. For distributed training on multiple Ascend 910 devices, please run
# distributed training on multiple NPU devices
msrun --bind_core=True --worker_num 8 python train.py --config configs/swintransformerv2/swinv2_tiny_window8_ascend.yaml --data_dir /path/to/imagenet
For detailed illustration of all hyper-parameters, please refer to config.py.
Note: As the global batch size (batch_size x num_devices) is an important hyper-parameter, it is recommended to keep the global batch size unchanged for reproduction or adjust the learning rate linearly to a new global batch size.
- Standalone Training
If you want to train or finetune the model on a smaller dataset without distributed training, please run:
# standalone training on single NPU device
python train.py --config configs/swintransformerv2/swinv2_tiny_window8_ascend.yaml --data_dir /path/to/dataset --distribute False
To validate the accuracy of the trained model, you can use validate.py
and parse the checkpoint path
with --ckpt_path
.
python validate.py -c configs/swintransformerv2/swinv2_tiny_window8_ascend.yaml --data_dir /path/to/imagenet --ckpt_path /path/to/ckpt
[1] Liu Z, Hu H, Lin Y, et al. Swin transformer v2: Scaling up capacity and resolution[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 12009-12019.