Skip to content

RoyCoding8/image-captioning-transformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Captioning with CLIP and Transformer Decoder

A PyTorch implementation of image captioning using CLIP ViT-B/16 as the visual encoder and a Transformer decoder for autoregressive caption generation.

Results

Evaluated on Flickr8k test set:

Metric Score
BLEU-4 0.251
METEOR 0.473
ROUGE-L 0.506
CIDEr 3.59

Architecture

  • Encoder: CLIP ViT-B/16 (frozen), outputs 196 visual tokens
  • Decoder: 4-layer Transformer with 8-head cross-attention
  • Inference: Beam search (k=3)

Usage

Open image_captioning_transformer.ipynb in Google Colab with A100 GPU and run all cells. Training completes in ~10 minutes.

Requirements

torch>=2.0.0
transformers>=4.30.0
nltk
rouge-score

See requirements.txt for full list.

Project Structure

├── image_captioning_transformer.ipynb   # Training and evaluation
├── requirements.txt
├── assets/                              # Sample outputs
└── weights/                             # Trained model (after training)

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors