최근 들어서 large-scale web image-text dataset pre-trained vision foundation model (CLIP, ALIGN)이 zero-shot capability, efficient transfer learning을 보여줬음
하지만 이 모델들은 image와 text representation이 mapping되도록 학습하여, cross-modal shared representation을 배움
→ classification, retrieval, tagging과 같은 image to text mapping task만 잘 수행함
저자들은 이러한 한계를 보고, 다음과 같은 질문이 떠올랐다고 함
"What is the foundation model for computer vision?"
결론만 말하면, 다양한 vision task를 수행할 수 있도록 pre-trained model + adapter로 구성해야한다...가 저자들의 답임
해당 논문의 3가지 main contribution은 다음과 같음
- 다양한 computer vision tasks를 3개의 축을 이용하여 분류
space, time, modality
- space-time-modality space의 vision tasks를 수행할 수 있는 Florence를 제안함
Florence의 핵심 4가지 : data curation, model pre-training, task adaptation, training infrastructure
- Florence는 다양한 vision task에서 좋은 성능을 보임
중요하다고 생각되는 부분만 간단히 요약
1. Space-Time-Modality Space
3개의 축을 이용하여 다양한 computer vision task를 분류할 수 있음
- space : from coarse (ex. scene-level classification) to fine-grained (ex. object detection)
- time : from static (ex. images) to dynamic (ex. videos)
- modality : from RGB only to multiple senses (ex. language, depth)
2. Approach
Florence를 만드는 데 신경쓴 4가지 core를 소개
2.1. Dataset Curation
- 인터넷에서 총 3 billion image & raw description 수집
- data filtering
hash-based near-duplicate image removal, small-size image removal, image-text relevance
- data selection
to improve balance, informativeness, learnability
- post-filtering
for legal, ethical constraints
위 과정을 통해 900 million image-text pair dataset을 만들었음
→ FLD-900M (FLorecneDataset)라고 명칭
900M images with 900M free-form texts (ranging from one word, phase to sentences)
9.7M unique queries, 7.5B tokens
2.2. Model Pre-training
Unified Image-Text Contrastive Learning
web-scale data이기에, multiple images가 identical captions를 가질 수 있음
→ CLIP이 아니라 UniCL objective로 pre-train
저자들은 실험을 통해 rich content long language description이 short description보다 image-text representation learning에 beneficial함을 관측했다고 함
따라서, prompt templates를 구축하여 short description을 augment해줌
(ex. "A photo of the [WORD]", "A cropped photo of [WORD]")
학습할 때는 templates 중 하나를 random sample하여 augment
하지만 generated language prompt는 image에 대한 precise description이 아님
저자들은 실험을 통해 generated language prompt가 retrieval, vision-language task 성능에 악영향을 준다는 것을 발견함
이를 방지하기 위해, 2 stage로 모델 학습
stage 1 : augmented texts를 포함한 모든 data로 모델 학습
stage 2 : augmented data를 제외하고 모델 학습
image size : 224 x 224 → 384 x 384
language description max length : 76
batch size : 24,576
Transformer-based Florence Pre-trained Models
image encoder : hierarchical Vision Transformer
why use hierarchical architecture? → model the scale invariance nature of images
use modified Swin Transformer → CoSwin Transformer
CvT 참고해서 모델 수정했다고 함
global average pooling 이용해서 image feature 추출
language encoder : 12-layer Transformer
2.3. Task Adaptation
pre-trained image encoder를 frozen해서 쓰기보다는, 대부분 fine-tune해서 사용
Object-level Visual Representation Learning
Florence가 fine-grained representations를 학습할 수 있도록, large-scale object detection dataset 구축
→ FLOD-9M (FLorence Object detection Dataset)
FLOD-9M = opensource object detection dataset (COCO, LVIS, OpenImages, Object 365) + pseudo labeled dataset (ImageNet-22K with pseudo bounding boxes)
image encoder는 freeze하고 Dynamic Head adapter만 FLOD-9M으로 pre-train
Fine-Grained V+L Representation Learning
fine-grained vision-language representation을 학습할 수 있도록 METER adapter 사용
image-text matching loss, masked language modeling loss로 학습한 다음, VQA 같은 downstream task로 fine-tune
image encoder를 freeze하지 않는 것으로 추측됨
Adaptation to Video Recognition
Video CoSwin adapter
image tokenization layer (2D) → video tokenization layer (3D)
2D patch merging operator → 3D convolution-based patch merging operator
2D shifted window design → 3D shifted local windows in self-attention layers
video task도 수행할 수 있도록 Video CoSwin adapter 사용
memory issue를 피하기 위해 dynamic window size strategy 적용
(relatively small window size in early stages of CoSwin, and large window sizes in its later stages)
2.4. Training Infrastructure
Florence를 학습하는데 있어 2가지 challenge가 있음
→ reducing the memory cost, increasing the throughput
reducing the memory cost → ZeRO, activation checkpointing, gradient cache
increasing the throughput → mixed-precision training
- Zero Redundancy Optimizer (ZeRO)
partitions the optimizer states, gradients and parameters across the GPUs and each partition is only updated locally
- Activation Checkpointing
for checkpointed model component (multi-head attention), reruns a forward pass during backward pass
internal gradients in the component do not need to be stored in the forward pass
- Gradient Cache
factor the contrastive loss by breaking the large batch gradient update into several sub-updates that can fit into GPU memory
- Mixed-precision Training
trained various operations with different numerical precision
numerically less stable operations (layer normalization) → float-32
other operations → float-16
3. Experiments
Florence 자체가 foundation model이라서 이걸로 다 할 수 있다...가 아님
UniCL로 pre-trained Florence가 있고, 각각의 downstream task에 맞게 adapter 붙여서 fine-tune하면 성능 좋더라...임
기존에 CLIP으로 할 수 있는 classification, retrieval을 제외하고는 adapter를 학습해야하며
object detection을 제외한 모든 task는 image encoder도 같이 fine-tune해줌
(object detection은 image encoder freeze하고 adapter만 학습하는 zero-shot 수행 가능)
즉, Florence는 좋은 weight initialization starting point이다...라고 생각하면 됨
3.1. Classification & Retrieval
Zero-shot Transfer in Classification
Florence outperforms on 9/12 tasks compared with state-of-the-art methods
Linear Probe in Classification
freeze image encoder and only fine-tune the linear layer on the downstream datasets
our results are consistently better than existing state-of-the-art results, expect for two datasets: CIFAR10, CIFAR100
why? → the input image resolution is quite low
ImageNet-1K Fine-tune Evaluation
train on task-specific data using the same pre-training loss (UniCL)
our result is slightly worse than SOTA, but their model and data scale are both 3× larger
Few-shot Cross-domain Classification
append a single linear layer as an adapter head to our image encoder
previous work employs ensembes and transductive learning
we employ a single model and no transduction on the test data
yet we achieve higher results without any "bells and whistles"
Image-Text Retrieval
for fine-tuning retrieval, we continuously train our image and text encoders on the target image-text pair data
zero-shot Florence matches or outperforms all prior zero-shot results on these two datasets
our results are superior to all previous fine-tuning results on the two datasets
3.2. Space (Object Detection)
Fine-tuning
evaluate fine-tuning on three popular object detection datasets
→ COCO, Object365, Visual Genome
Florence establishes new results in these main benchmarks of object detection
Zero-shot Transfer
freeze the image encoder and pre-train the Dynamic Head on FLOD-9M
→ detection pre-training only updates the object adapter, and does not affect the fused feature representations learned from large-scale image-text pairs
inference : get object proposals from pre-trained image encoder + Dynamic Head and apply zero-shot classification for each object proposal
Florence model effectively zero-shot transfers to these tasks
noticeable performance gap between zero-shot and supervised learning, especially for novel scenarios whose concepts/classes may not be covered by the pre-training dataset
ex. BCCD (blood cells photos), Chess Pieces (Chess board photos and various pieces)
however, the results are encouraging when compared with few-shot fine-tuning results
3.3. Modality (Vision & Language)
VQA
fine-tune the pre-trained model on the VQA task
as a common practice, the problem is cast as a classification task where each class corresponds to an answer
final pooling representations + MLP to predict the answer
use binary cross-entropy loss
inference : select the answer with the highest confidence
we achieve the new state-of-the-art performance
SimVLM uses 1.8B image-text pairs, but we only use 900M data for pre-train and 20M for VLP
→ data efficiency of our approach
3.4. Time (Video)
Zero-Shot Text-to-Video Retrieval
zero-shot을 어떻게 하겠다는지...가 제대로 설명이 안되어있음
frame-level representations를 average pool한걸 video representation이라고 한게 아닐까?라는 생각
two image-text pretrained models CLIP and Florence outperform all the state-of-the-art methods
→ reveals that the video data used for pre-training in these state-of-the-art methods may not be so rich or diverse as image-text data used in Florence or CLIP
Video Action Recognition
fine tune the model
our results are better than the state-of-the-art
최근 들어서 large-scale web image-text dataset pre-trained vision foundation model (CLIP, ALIGN)이 zero-shot capability, efficient transfer learning을 보여줬음
하지만 이 모델들은 image와 text representation이 mapping되도록 학습하여, cross-modal shared representation을 배움
→ classification, retrieval, tagging과 같은 image to text mapping task만 잘 수행함
저자들은 이러한 한계를 보고, 다음과 같은 질문이 떠올랐다고 함
"What is the foundation model for computer vision?"
결론만 말하면, 다양한 vision task를 수행할 수 있도록 pre-trained model + adapter로 구성해야한다...가 저자들의 답임
해당 논문의 3가지 main contribution은 다음과 같음
space, time, modality
Florence의 핵심 4가지 : data curation, model pre-training, task adaptation, training infrastructure
중요하다고 생각되는 부분만 간단히 요약
1. Space-Time-Modality Space
2. Approach
2.1. Dataset Curation
hash-based near-duplicate image removal, small-size image removal, image-text relevance
to improve balance, informativeness, learnability
for legal, ethical constraints
위 과정을 통해 900 million image-text pair dataset을 만들었음
→ FLD-900M (FLorecneDataset)라고 명칭
900M images with 900M free-form texts (ranging from one word, phase to sentences)
9.7M unique queries, 7.5B tokens
2.2. Model Pre-training
Unified Image-Text Contrastive Learning
Transformer-based Florence Pre-trained Models
2.3. Task Adaptation
Object-level Visual Representation Learning
Fine-Grained V+L Representation Learning
Adaptation to Video Recognition
2.4. Training Infrastructure
partitions the optimizer states, gradients and parameters across the GPUs and each partition is only updated locally
for checkpointed model component (multi-head attention), reruns a forward pass during backward pass
internal gradients in the component do not need to be stored in the forward pass
factor the contrastive loss by breaking the large batch gradient update into several sub-updates that can fit into GPU memory
trained various operations with different numerical precision
numerically less stable operations (layer normalization) → float-32
other operations → float-16
3. Experiments
3.1. Classification & Retrieval
Zero-shot Transfer in Classification
Linear Probe in Classification
ImageNet-1K Fine-tune Evaluation
Few-shot Cross-domain Classification
Image-Text Retrieval
3.2. Space (Object Detection)
Fine-tuning
Zero-shot Transfer
3.3. Modality (Vision & Language)
VQA
3.4. Time (Video)
Zero-Shot Text-to-Video Retrieval
Video Action Recognition