A PyTorch implementation of image captioning using CLIP ViT-B/16 as the visual encoder and a Transformer decoder for autoregressive caption generation.
Evaluated on Flickr8k test set:
| Metric | Score |
|---|---|
| BLEU-4 | 0.251 |
| METEOR | 0.473 |
| ROUGE-L | 0.506 |
| CIDEr | 3.59 |
- Encoder: CLIP ViT-B/16 (frozen), outputs 196 visual tokens
- Decoder: 4-layer Transformer with 8-head cross-attention
- Inference: Beam search (k=3)
Open image_captioning_transformer.ipynb in Google Colab with A100 GPU and run all cells. Training completes in ~10 minutes.
torch>=2.0.0
transformers>=4.30.0
nltk
rouge-score
See requirements.txt for full list.
├── image_captioning_transformer.ipynb # Training and evaluation
├── requirements.txt
├── assets/ # Sample outputs
└── weights/ # Trained model (after training)
MIT