Skip to content

[2021 arXiv] Florence: A New Foundation Model for Computer Vision #220

@Jasonlee1995

Description

@Jasonlee1995

최근 들어서 large-scale web image-text dataset pre-trained vision foundation model (CLIP, ALIGN)이 zero-shot capability, efficient transfer learning을 보여줬음

하지만 이 모델들은 image와 text representation이 mapping되도록 학습하여, cross-modal shared representation을 배움
→ classification, retrieval, tagging과 같은 image to text mapping task만 잘 수행함

저자들은 이러한 한계를 보고, 다음과 같은 질문이 떠올랐다고 함
"What is the foundation model for computer vision?"

결론만 말하면, 다양한 vision task를 수행할 수 있도록 pre-trained model + adapter로 구성해야한다...가 저자들의 답임

해당 논문의 3가지 main contribution은 다음과 같음

  1. 다양한 computer vision tasks를 3개의 축을 이용하여 분류
    space, time, modality
  2. space-time-modality space의 vision tasks를 수행할 수 있는 Florence를 제안함
    Florence의 핵심 4가지 : data curation, model pre-training, task adaptation, training infrastructure
  3. Florence는 다양한 vision task에서 좋은 성능을 보임

중요하다고 생각되는 부분만 간단히 요약

1. Space-Time-Modality Space

image

3개의 축을 이용하여 다양한 computer vision task를 분류할 수 있음

  1. space : from coarse (ex. scene-level classification) to fine-grained (ex. object detection)
  2. time : from static (ex. images) to dynamic (ex. videos)
  3. modality : from RGB only to multiple senses (ex. language, depth)

2. Approach

Florence를 만드는 데 신경쓴 4가지 core를 소개

2.1. Dataset Curation

  1. 인터넷에서 총 3 billion image & raw description 수집
  2. data filtering
    hash-based near-duplicate image removal, small-size image removal, image-text relevance
  3. data selection
    to improve balance, informativeness, learnability
  4. post-filtering
    for legal, ethical constraints

위 과정을 통해 900 million image-text pair dataset을 만들었음
→ FLD-900M (FLorecneDataset)라고 명칭
900M images with 900M free-form texts (ranging from one word, phase to sentences)
9.7M unique queries, 7.5B tokens

2.2. Model Pre-training

Unified Image-Text Contrastive Learning

web-scale data이기에, multiple images가 identical captions를 가질 수 있음
→ CLIP이 아니라 UniCL objective로 pre-train

저자들은 실험을 통해 rich content long language description이 short description보다 image-text representation learning에 beneficial함을 관측했다고 함
따라서, prompt templates를 구축하여 short description을 augment해줌
(ex. "A photo of the [WORD]", "A cropped photo of [WORD]")
학습할 때는 templates 중 하나를 random sample하여 augment

하지만 generated language prompt는 image에 대한 precise description이 아님
저자들은 실험을 통해 generated language prompt가 retrieval, vision-language task 성능에 악영향을 준다는 것을 발견함
이를 방지하기 위해, 2 stage로 모델 학습
stage 1 : augmented texts를 포함한 모든 data로 모델 학습
stage 2 : augmented data를 제외하고 모델 학습

image size : 224 x 224 → 384 x 384
language description max length : 76
batch size : 24,576

Transformer-based Florence Pre-trained Models

image encoder : hierarchical Vision Transformer
why use hierarchical architecture? → model the scale invariance nature of images
use modified Swin Transformer → CoSwin Transformer
CvT 참고해서 모델 수정했다고 함
global average pooling 이용해서 image feature 추출

language encoder : 12-layer Transformer

2.3. Task Adaptation

pre-trained image encoder를 frozen해서 쓰기보다는, 대부분 fine-tune해서 사용

Object-level Visual Representation Learning

Florence가 fine-grained representations를 학습할 수 있도록, large-scale object detection dataset 구축
→ FLOD-9M (FLorence Object detection Dataset)
FLOD-9M = opensource object detection dataset (COCO, LVIS, OpenImages, Object 365) + pseudo labeled dataset (ImageNet-22K with pseudo bounding boxes)

image

image encoder는 freeze하고 Dynamic Head adapter만 FLOD-9M으로 pre-train

Fine-Grained V+L Representation Learning image

fine-grained vision-language representation을 학습할 수 있도록 METER adapter 사용
image-text matching loss, masked language modeling loss로 학습한 다음, VQA 같은 downstream task로 fine-tune
image encoder를 freeze하지 않는 것으로 추측됨

Adaptation to Video Recognition

Video CoSwin adapter
image tokenization layer (2D) → video tokenization layer (3D)
2D patch merging operator → 3D convolution-based patch merging operator
2D shifted window design → 3D shifted local windows in self-attention layers

video task도 수행할 수 있도록 Video CoSwin adapter 사용
memory issue를 피하기 위해 dynamic window size strategy 적용
(relatively small window size in early stages of CoSwin, and large window sizes in its later stages)

2.4. Training Infrastructure

Florence를 학습하는데 있어 2가지 challenge가 있음
→ reducing the memory cost, increasing the throughput

reducing the memory cost → ZeRO, activation checkpointing, gradient cache
increasing the throughput → mixed-precision training

  • Zero Redundancy Optimizer (ZeRO)
    partitions the optimizer states, gradients and parameters across the GPUs and each partition is only updated locally
  • Activation Checkpointing
    for checkpointed model component (multi-head attention), reruns a forward pass during backward pass
    internal gradients in the component do not need to be stored in the forward pass
  • Gradient Cache
    factor the contrastive loss by breaking the large batch gradient update into several sub-updates that can fit into GPU memory
  • Mixed-precision Training
    trained various operations with different numerical precision
    numerically less stable operations (layer normalization) → float-32
    other operations → float-16

3. Experiments

Florence 자체가 foundation model이라서 이걸로 다 할 수 있다...가 아님
UniCL로 pre-trained Florence가 있고, 각각의 downstream task에 맞게 adapter 붙여서 fine-tune하면 성능 좋더라...임
기존에 CLIP으로 할 수 있는 classification, retrieval을 제외하고는 adapter를 학습해야하며
object detection을 제외한 모든 task는 image encoder도 같이 fine-tune해줌
(object detection은 image encoder freeze하고 adapter만 학습하는 zero-shot 수행 가능)

즉, Florence는 좋은 weight initialization starting point이다...라고 생각하면 됨

3.1. Classification & Retrieval

Zero-shot Transfer in Classification image

Florence outperforms on 9/12 tasks compared with state-of-the-art methods

Linear Probe in Classification image

freeze image encoder and only fine-tune the linear layer on the downstream datasets

our results are consistently better than existing state-of-the-art results, expect for two datasets: CIFAR10, CIFAR100
why? → the input image resolution is quite low

ImageNet-1K Fine-tune Evaluation image

train on task-specific data using the same pre-training loss (UniCL)

our result is slightly worse than SOTA, but their model and data scale are both 3× larger

Few-shot Cross-domain Classification image

append a single linear layer as an adapter head to our image encoder

previous work employs ensembes and transductive learning
we employ a single model and no transduction on the test data
yet we achieve higher results without any "bells and whistles"

Image-Text Retrieval image

for fine-tuning retrieval, we continuously train our image and text encoders on the target image-text pair data

zero-shot Florence matches or outperforms all prior zero-shot results on these two datasets
our results are superior to all previous fine-tuning results on the two datasets

3.2. Space (Object Detection)

image
Fine-tuning image

evaluate fine-tuning on three popular object detection datasets
→ COCO, Object365, Visual Genome

Florence establishes new results in these main benchmarks of object detection

Zero-shot Transfer image

freeze the image encoder and pre-train the Dynamic Head on FLOD-9M
→ detection pre-training only updates the object adapter, and does not affect the fused feature representations learned from large-scale image-text pairs
inference : get object proposals from pre-trained image encoder + Dynamic Head and apply zero-shot classification for each object proposal

Florence model effectively zero-shot transfers to these tasks

noticeable performance gap between zero-shot and supervised learning, especially for novel scenarios whose concepts/classes may not be covered by the pre-training dataset
ex. BCCD (blood cells photos), Chess Pieces (Chess board photos and various pieces)

however, the results are encouraging when compared with few-shot fine-tuning results

3.3. Modality (Vision & Language)

VQA image

fine-tune the pre-trained model on the VQA task
as a common practice, the problem is cast as a classification task where each class corresponds to an answer
final pooling representations + MLP to predict the answer
use binary cross-entropy loss
inference : select the answer with the highest confidence

we achieve the new state-of-the-art performance
SimVLM uses 1.8B image-text pairs, but we only use 900M data for pre-train and 20M for VLP
→ data efficiency of our approach

3.4. Time (Video)

Zero-Shot Text-to-Video Retrieval image

zero-shot을 어떻게 하겠다는지...가 제대로 설명이 안되어있음
frame-level representations를 average pool한걸 video representation이라고 한게 아닐까?라는 생각

two image-text pretrained models CLIP and Florence outperform all the state-of-the-art methods
→ reveals that the video data used for pre-training in these state-of-the-art methods may not be so rich or diverse as image-text data used in Florence or CLIP

Video Action Recognition image

fine tune the model

our results are better than the state-of-the-art

Metadata

Metadata

Assignees

No one assigned

    Labels

    LanguageRelated with Natural Language Processing tasksVisionRelated with Computer Vision tasks

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions