[2021 arXiv] Florence: A New Foundation Model for Computer Vision

최근 들어서 large-scale web image-text dataset pre-trained vision foundation model (CLIP, ALIGN)이 zero-shot capability, efficient transfer learning을 보여줬음

하지만 이 모델들은 image와 text representation이 mapping되도록 학습하여, cross-modal shared representation을 배움
→ classification, retrieval, tagging과 같은 image to text mapping task만 잘 수행함

저자들은 이러한 한계를 보고, 다음과 같은 질문이 떠올랐다고 함
"What is the foundation model for computer vision?"

결론만 말하면, 다양한 vision task를 수행할 수 있도록 pre-trained model + adapter로 구성해야한다...가 저자들의 답임

해당 논문의 3가지 main contribution은 다음과 같음

1. 다양한 computer vision tasks를 3개의 축을 이용하여 분류
  space, time, modality
2. space-time-modality space의 vision tasks를 수행할 수 있는 Florence를 제안함
  Florence의 핵심 4가지 : data curation, model pre-training, task adaptation, training infrastructure
3. Florence는 다양한 vision task에서 좋은 성능을 보임

중요하다고 생각되는 부분만 간단히 요약


## 1. Space-Time-Modality Space
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/bff2e715-b9c9-4463-a461-b8432579dc26">

> 3개의 축을 이용하여 다양한 computer vision task를 분류할 수 있음

1. space : from coarse (ex. scene-level classification) to fine-grained (ex. object detection)
2. time : from static (ex. images) to dynamic (ex. videos)
3. modality : from RGB only to multiple senses (ex. language, depth)


## 2. Approach

> Florence를 만드는 데 신경쓴 4가지 core를 소개


### 2.1. Dataset Curation

1. 인터넷에서 총 3 billion image & raw description 수집
2. data filtering
  hash-based near-duplicate image removal, small-size image removal, image-text relevance
3. data selection
  to improve balance, informativeness, learnability
4. post-filtering
  for legal, ethical constraints

위 과정을 통해 900 million image-text pair dataset을 만들었음
→ FLD-900M (FLorecneDataset)라고 명칭
900M images with 900M free-form texts (ranging from one word, phase to sentences)
9.7M unique queries, 7.5B tokens


### 2.2. Model Pre-training

<details><summary>Unified Image-Text Contrastive Learning</summary>

> web-scale data이기에, multiple images가 identical captions를 가질 수 있음
> → CLIP이 아니라 UniCL objective로 pre-train

> 저자들은 실험을 통해 rich content long language description이 short description보다 image-text representation learning에 beneficial함을 관측했다고 함
> 따라서, prompt templates를 구축하여 short description을 augment해줌
> (ex. "A photo of the [WORD]", "A cropped photo of [WORD]")
> 학습할 때는 templates 중 하나를 random sample하여 augment

> 하지만 generated language prompt는 image에 대한 precise description이 아님
> 저자들은 실험을 통해 generated language prompt가 retrieval, vision-language task 성능에 악영향을 준다는 것을 발견함
> 이를 방지하기 위해, 2 stage로 모델 학습
> stage 1 : augmented texts를 포함한 모든 data로 모델 학습
> stage 2 : augmented data를 제외하고 모델 학습

> image size : 224 x 224 → 384 x 384
> language description max length : 76
> batch size : 24,576

</details>


<details><summary>Transformer-based Florence Pre-trained Models</summary>

> image encoder : hierarchical Vision Transformer
> why use hierarchical architecture? → model the scale invariance nature of images
> use modified Swin Transformer → CoSwin Transformer
> CvT 참고해서 모델 수정했다고 함
> global average pooling 이용해서 image feature 추출

> language encoder : 12-layer Transformer

</details>


### 2.3. Task Adaptation

> pre-trained image encoder를 frozen해서 쓰기보다는, 대부분 fine-tune해서 사용


<details><summary>Object-level Visual Representation Learning</summary>

> Florence가 fine-grained representations를 학습할 수 있도록, large-scale object detection dataset 구축
> → FLOD-9M (FLorence Object detection Dataset)
> FLOD-9M = opensource object detection dataset (COCO, LVIS, OpenImages, Object 365) + pseudo labeled dataset (ImageNet-22K with pseudo bounding boxes)

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/d03aef0e-94ce-4e6c-8522-f3cee2413b38">

> image encoder는 freeze하고 Dynamic Head adapter만 FLOD-9M으로 pre-train

</details>


<details><summary>Fine-Grained V+L Representation Learning</summary>

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/69a9f943-b3c2-4cf3-a24d-1c6871aea8ad">

> fine-grained vision-language representation을 학습할 수 있도록 METER adapter 사용
> image-text matching loss, masked language modeling loss로 학습한 다음, VQA 같은 downstream task로 fine-tune
> image encoder를 freeze하지 않는 것으로 추측됨

</details>


<details><summary>Adaptation to Video Recognition</summary>

> Video CoSwin adapter
> image tokenization layer (2D) → video tokenization layer (3D)
> 2D patch merging operator → 3D convolution-based patch merging operator
> 2D shifted window design → 3D shifted local windows in self-attention layers

> video task도 수행할 수 있도록 Video CoSwin adapter 사용
> memory issue를 피하기 위해 dynamic window size strategy 적용
> (relatively small window size in early stages of CoSwin, and large window sizes in its later stages)

</details>


### 2.4. Training Infrastructure

> Florence를 학습하는데 있어 2가지 challenge가 있음
> → reducing the memory cost, increasing the throughput

> reducing the memory cost → ZeRO, activation checkpointing, gradient cache
> increasing the throughput → mixed-precision training

- Zero Redundancy Optimizer (ZeRO)
  partitions the optimizer states, gradients and parameters across the GPUs and each partition is only updated locally
- Activation Checkpointing
  for checkpointed model component (multi-head attention), reruns a forward pass during backward pass
  internal gradients in the component do not need to be stored in the forward pass
- Gradient Cache
  factor the contrastive loss by breaking the large batch gradient update into several sub-updates that can fit into GPU memory
- Mixed-precision Training
  trained various operations with different numerical precision
  numerically less stable operations (layer normalization) → float-32
  other operations → float-16


## 3. Experiments

> Florence 자체가 foundation model이라서 이걸로 다 할 수 있다...가 아님
> UniCL로 pre-trained Florence가 있고, 각각의 downstream task에 맞게 adapter 붙여서 fine-tune하면 성능 좋더라...임
> 기존에 CLIP으로 할 수 있는 classification, retrieval을 제외하고는 adapter를 학습해야하며
> object detection을 제외한 모든 task는 image encoder도 같이 fine-tune해줌
> (object detection은 image encoder freeze하고 adapter만 학습하는 zero-shot 수행 가능)

> 즉, Florence는 좋은 weight initialization starting point이다...라고 생각하면 됨

### 3.1. Classification & Retrieval

<details><summary>Zero-shot Transfer in Classification</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/88fbaac8-cdf5-4132-ad4c-21c1422822fa">

> Florence outperforms on 9/12 tasks compared with state-of-the-art methods

</details>


<details><summary>Linear Probe in Classification</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/0cf12f02-1dbf-4fc8-ba3e-527ab096443e">

> freeze image encoder and only fine-tune the linear layer on the downstream datasets

> our results are consistently better than existing state-of-the-art results, expect for two datasets: CIFAR10, CIFAR100
> why? → the input image resolution is quite low

</details>


<details><summary>ImageNet-1K Fine-tune Evaluation</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/13588831-2451-446d-9192-9e19d3c612b5">

> train on task-specific data using the same pre-training loss (UniCL)

> our result is slightly worse than SOTA, but their model and data scale are both 3× larger

</details>


<details><summary>Few-shot Cross-domain Classification</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/2b6325e9-5a7d-4118-9b49-51df283483bf">

> append a single linear layer as an adapter head to our image encoder

> previous work employs ensembes and transductive learning
> we employ a single model and no transduction on the test data
> yet we achieve higher results without any "bells and whistles"

</details>

<details><summary>Image-Text Retrieval</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/ebea4407-df63-45b6-b882-97411e06f297">

> for fine-tuning retrieval, we continuously train our image and text encoders on the target image-text pair data

> zero-shot Florence matches or outperforms all prior zero-shot results on these two datasets
> our results are superior to all previous fine-tuning results on the two datasets

</details>


### 3.2. Space (Object Detection)
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/e3931382-2a21-45b7-ac95-d9ba6f567d5f">

<details><summary>Fine-tuning</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/53b68952-262d-4769-8e68-cd8f5297ef7e">

> evaluate fine-tuning on three popular object detection datasets
> → COCO, Object365, Visual Genome

> Florence establishes new results in these main benchmarks of object detection

</details>


<details><summary>Zero-shot Transfer</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/50570558-715a-474e-a8e5-53ba94b16e73">

> freeze the image encoder and pre-train the Dynamic Head on FLOD-9M
> → detection pre-training only updates the object adapter, and does not affect the fused feature representations learned from large-scale image-text pairs
> inference : get object proposals from pre-trained image encoder + Dynamic Head and apply zero-shot classification for each object proposal

> Florence model effectively zero-shot transfers to these tasks

> noticeable performance gap between zero-shot and supervised learning, especially for novel scenarios whose concepts/classes may not be covered by the pre-training dataset
> ex. BCCD (blood cells photos), Chess Pieces (Chess board photos and various pieces)

> however, the results are encouraging when compared with few-shot fine-tuning results

</details>


### 3.3. Modality (Vision & Language)

<details><summary>VQA</summary>
<img width="50%" alt="image" src="https://github.com/user-attachments/assets/4b546f3f-3a8a-4dd9-93a3-6c439813eeef">

> fine-tune the pre-trained model on the VQA task
> as a common practice, the problem is cast as a classification task where each class corresponds to an answer
> final pooling representations + MLP to predict the answer
> use binary cross-entropy loss
> inference : select the answer with the highest confidence

> we achieve the new state-of-the-art performance
> SimVLM uses 1.8B image-text pairs, but we only use 900M data for pre-train and 20M for VLP
> → data efficiency of our approach

</details>


### 3.4. Time (Video)

<details><summary>Zero-Shot Text-to-Video Retrieval</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/abb62eca-1fff-4740-9bdb-4c19455210b4">

> zero-shot을 어떻게 하겠다는지...가 제대로 설명이 안되어있음
> frame-level representations를 average pool한걸 video representation이라고 한게 아닐까?라는 생각

> two image-text pretrained models CLIP and Florence outperform all the state-of-the-art methods
> → reveals that the video data used for pre-training in these state-of-the-art methods may not be so rich or diverse as image-text data used in Florence or CLIP

</details>


<details><summary>Video Action Recognition</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/11134561-6a5f-4966-bfdc-1af023731bfc">

> fine tune the model

> our results are better than the state-of-the-art

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2021 arXiv] Florence: A New Foundation Model for Computer Vision #220

1. Space-Time-Modality Space

2. Approach

2.1. Dataset Curation

2.2. Model Pre-training

2.3. Task Adaptation

2.4. Training Infrastructure

3. Experiments

3.1. Classification & Retrieval

3.2. Space (Object Detection)

3.3. Modality (Vision & Language)

3.4. Time (Video)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2021 arXiv] Florence: A New Foundation Model for Computer Vision #220

Description

1. Space-Time-Modality Space

2. Approach

2.1. Dataset Curation

2.2. Model Pre-training

2.3. Task Adaptation

2.4. Training Infrastructure

3. Experiments

3.1. Classification & Retrieval

3.2. Space (Object Detection)

3.3. Modality (Vision & Language)

3.4. Time (Video)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions