This repository contains experiments and results of comparison of image features extractors generated by classical training on ImageNet and CLIP[paper][repo][blog] training procedure on a specific Fruits-360 dataset.
The procedure described in the CLIP paper allows to make predictions on a new image dataset with any set of labels without training. Example of zero-shot predictions on Sports-72 dataset. Caption format: Predicted (True)
We compared features extractors with different architectures, training procedures and image upsampling techniques. If an image upsampling technique is not mentioned, then bicubic interpolation is used. We performed the following two main sections of experiments:
- Linear probing and fine-tuning of CLIP with ResNet and ViT backbones and ImageNet-pretrained ResNet and EfficientNet
- Zero-shot and K-shot classification of CLIP with ViT and ResNet backbones
We also compared 2 image upsampling options:
- Bucubic interpolations
- SRGAN upsampling [weights]
We did it on the following training setups: linear probing and contrastive fine-tuning of CLIP with ResNet and ViT backbones.
Main plots can be found in the results section. Full experiments descriptions can be found in the supplementary/report.pdf
notebooks/
— contains experiments in form of jupyter notebooks
├── few_shot_learning.ipynb
— k-shot learning procedure
├── image_upsampling.ipynb
— two ways to upsample images with subsequent saving
├── prompts_validation.ipynb
— finding the best prompt for given dataset
├── train_ImageNet_models.ipynb
— fine-tuning of models pretrained on ImageNet in different settings
└── train_CLIP.ipynb
— fine-tuning CLIP models in different settingsdata_prepare/
— dataset upsampling auxilary source codesrc/
— training related auxilary source codepics/
— pictures for the results partsupplementary/
— contains report and presentation in.pdf
format
We tested zero-shot prediction performance of CLIP on a number of domain-specific datasets. These are Birds-270, Simpsons characters, Sports-72, Fruits-360. Here are some examples of the predictions:
Simpsons characters [link] ~ 0.51 accuracy
Birds-270 [link] ~ 0.52 accuracy
Fruits-360 [link] ~ 0.24 accuracy
Sports-72 [link] ~ 0.79 accuracy
Pretained CLIP model with ResNet-101 backbone + new fully-connected layer which is trained only on k examples of each class.
Fune-tuning of visual parts of CLIP models with linear classifier on top with frozen/trainable backbones
Fine-tuning CLIP visual models using different methods and upsamplings.
- Maximizing likelihood (ML), i.e. training CLIP visual model + a linear layer on top
- Cosine Similarity maximizing (CS). Fine-tune CLIP visual model to maximize cosine similarity between images of the same class.
Each method was tested with ResNet-101/ViT backbones and bicubic/GAN upsampling