I had run the eval_flickr.sh with the following command
torchrun --nproc_per_node 1 -m \ --master_addr=127.0.0.3 --master_port=29533 \ training.main \ --save-frequency=1 \ --zeroshot-frequency=1 \ --report-to=tensorboard \ --val-data=/home/ubuntu/Nishanth/experiments/retrieval/data/flickr8k/captions.txt \ --csv-separator="," \ --csv-img-key=image \ --csv-caption-key=caption \ --val-data-root=/home/ubuntu/Nishanth/experiments/retrieval/data/flickr8k/images \ --dataset-type=csv \ --warmup=10000 \ --batch-size=128 \ --lr=1e-3 \ --wd=0.1 \ --epochs=32 \ --workers=16 \ --model=ViT-T-16 \ --resume=/home/ubuntu/Nishanth/experiments/retrieval/models/ViT-B-16_cc3m_12m_kd_ViT-T-16_cc3m_12m_ep32.pt \ --eval \ --tag=eval_flickr
These are my observations:
ViT-B-16_cc3m_12m_kd_ViT-T-16_cc3m_12m_ep32.pt
Eval Epoch: 32 image_to_text_mean_rank: 42.6657 image_to_text_median_rank: 5.0000 image_to_text_R@1: 0.2584 image_to_text_R@5: 0.5004 image_to_text_R@10: 0.6085 text_to_image_mean_rank: 61.9712 text_to_image_median_rank: 7.0000 text_to_image_R@1: 0.2460 text_to_image_R@5: 0.4622 text_to_image_R@10: 0.5621 val_loss: 0.9544 epoch: 32.0000 num_samples: 8091.0000
ViT-L-14_laion400m_kd_ViT-T-16_cc3m_12m_ep32.pt
Eval Epoch: 32 image_to_text_mean_rank: 44.5211 image_to_text_median_rank: 6.0000 image_to_text_R@1: 0.2597 image_to_text_R@5: 0.4860 image_to_text_R@10: 0.5955 text_to_image_mean_rank: 62.1973 text_to_image_median_rank: 7.0000 text_to_image_R@1: 0.2406 text_to_image_R@5: 0.4557 text_to_image_R@10: 0.5598 val_loss: 0.9718 epoch: 32.0000 num_samples: 8091.0000
ViT-B-16_cc3m_12m_kd_ViT-T-16_cc3m_12m_ep32.pt
Eval Epoch: 32 image_to_text_mean_rank: 58.4999 image_to_text_median_rank: 8.0000 image_to_text_R@1: 0.2099 image_to_text_R@5: 0.4365 image_to_text_R@10: 0.5460 text_to_image_mean_rank: 72.9394 text_to_image_median_rank: 9.0000 text_to_image_R@1: 0.2081 text_to_image_R@5: 0.4138 text_to_image_R@10: 0.5244 val_loss: 1.2691 epoch: 32.0000 num_samples: 8091.0000
I had run the eval_flickr.sh with the following command
torchrun --nproc_per_node 1 -m \ --master_addr=127.0.0.3 --master_port=29533 \ training.main \ --save-frequency=1 \ --zeroshot-frequency=1 \ --report-to=tensorboard \ --val-data=/home/ubuntu/Nishanth/experiments/retrieval/data/flickr8k/captions.txt \ --csv-separator="," \ --csv-img-key=image \ --csv-caption-key=caption \ --val-data-root=/home/ubuntu/Nishanth/experiments/retrieval/data/flickr8k/images \ --dataset-type=csv \ --warmup=10000 \ --batch-size=128 \ --lr=1e-3 \ --wd=0.1 \ --epochs=32 \ --workers=16 \ --model=ViT-T-16 \ --resume=/home/ubuntu/Nishanth/experiments/retrieval/models/ViT-B-16_cc3m_12m_kd_ViT-T-16_cc3m_12m_ep32.pt \ --eval \ --tag=eval_flickrThese are my observations:
ViT-B-16_cc3m_12m_kd_ViT-T-16_cc3m_12m_ep32.pt
Eval Epoch: 32 image_to_text_mean_rank: 42.6657 image_to_text_median_rank: 5.0000 image_to_text_R@1: 0.2584 image_to_text_R@5: 0.5004 image_to_text_R@10: 0.6085 text_to_image_mean_rank: 61.9712 text_to_image_median_rank: 7.0000 text_to_image_R@1: 0.2460 text_to_image_R@5: 0.4622 text_to_image_R@10: 0.5621 val_loss: 0.9544 epoch: 32.0000 num_samples: 8091.0000
ViT-L-14_laion400m_kd_ViT-T-16_cc3m_12m_ep32.pt
Eval Epoch: 32 image_to_text_mean_rank: 44.5211 image_to_text_median_rank: 6.0000 image_to_text_R@1: 0.2597 image_to_text_R@5: 0.4860 image_to_text_R@10: 0.5955 text_to_image_mean_rank: 62.1973 text_to_image_median_rank: 7.0000 text_to_image_R@1: 0.2406 text_to_image_R@5: 0.4557 text_to_image_R@10: 0.5598 val_loss: 0.9718 epoch: 32.0000 num_samples: 8091.0000
ViT-B-16_cc3m_12m_kd_ViT-T-16_cc3m_12m_ep32.pt
Eval Epoch: 32 image_to_text_mean_rank: 58.4999 image_to_text_median_rank: 8.0000 image_to_text_R@1: 0.2099 image_to_text_R@5: 0.4365 image_to_text_R@10: 0.5460 text_to_image_mean_rank: 72.9394 text_to_image_median_rank: 9.0000 text_to_image_R@1: 0.2081 text_to_image_R@5: 0.4138 text_to_image_R@10: 0.5244 val_loss: 1.2691 epoch: 32.0000 num_samples: 8091.0000