Skip to content

[2022 CVPR] RegionCLIP: Region-based Language-Image Pretraining #222

@Jasonlee1995

Description

@Jasonlee1995
image

CLIP, ALIGN과 같은 vision-language model의 등장으로, zero-shot classification이 가능해짐

그렇다면 object region만 잘 뽑는다면, zero-shot object detection도 가능하지 않을까?라고 생각할 수 있음

저자들은 simple R-CNN style object detector + pre-trained CLIP을 이용하여 실험해본 결과, 성능이 좋지 않음을 확인함

문제를 파악하기 위해, 2가지 scenario에서 테스트 진행

  1. object region proposal network로 구한 bounding box를 이용한 경우
    Figure 1 (a) → CLIP score의 localization quality가 좋지 않음
  2. ground truth bounding box를 이용한 경우
    Figure 1 (b) → classification accuracy가 매우 낮아짐

그렇다면 CLIP 성능이 왜 안좋은가?

CLIP은 (image, image-level text) data에 대해서 학습했지, (image region, region-level text) data에 대해 학습하지 않았음

즉, fine-grained alignment를 학습하지 않았기에 domain-shift 때문에 성능이 안좋은 것

이를 극복하는 가장 간단한 방법은 (image region, region-level text) data로 CLIP을 학습하면 됨

저자들은 pre-trained CLIP을 이용하여 (image region, region-level text) data를 생성하며, pseudo region-text pair data로 pre-train한 모델인 RegionCLIP을 제안함

pseudo region-text pair data는 noisy하지만, human annotation이 필요 없기에 scalable하다는 장점을 가짐

중요하다고 생각되는 부분만 간단히 요약

1. Method

image
intuition

goal : zero-shot object detection

object detection을 detection, recognition으로 나눌 수 있음
detection → 기존에 존재하는 object region proposal network 사용
recognition → image region에 대해 zero-shot classification을 잘할 수 있도록 학습하자

recognition 모델을 어떻게 학습해야 될까?
image region과 region description이 match하도록 visual encoder를 잘 학습하면 됨
(language encoder는 학습 X)
하지만 학습할만한 large-scale (image region, region-level text) data가 없음

어떻게 (image region, region-level text) data를 확보할 수 있을까?
zero-shot object detection을 할 수 있으려면 다양한 object concept을 학습해야함
즉, (image, image-level text) data를 사용하는 것이 목적에 맞음

(image, image-level text) data를 (image region, region-level text) data로 만드는 데 2가지 문제가 있음

  1. (image, image-level text) data에는 (image region, region-level text)와 같은 fine-grained alignment가 존재하지 않음
    → pre-trained CLIP 이용해서 pseudo (image region, region-level text) data 만들자
  2. image 내에 있는 모든 object가 image-level text description에 담겨있지 않음
    → object concept pool 구축 + template 이용해서 region-level text description 만들자
pseudo region-text pair dataset construction

image region feature 추출 방법

  1. off-the-shelf object localizers (ex. RPN) 이용해서 image regions 추출
  2. vision encoder + feature pooling method (ex. RoIAlign)을 이용하여 region feature 추출

text feature 추출 방법

  1. off-the-shelf language parsers를 이용하여 object concepts pool 생성
  2. CLIP prompt template을 이용하여 text description 생성
  3. pre-trained CLIP text encoder를 이용하여 text feature 뽑음

pre-trained CLIP을 이용하여 pseudo region-text pair data 구축

region-based language-image pre-training image

3가지 loss로 pre-training
→ region-text contrastive loss, region-text distillation loss, image-level contrastive loss

image

region-text contrastive loss
CLIP loss와 다른 점 : text loss가 없음
(1개의 text prompt와 images간의 contrastive loss X)
temperature $\tau = 0.01$ 사용

image

region-text distillation loss
teacher model과 student model간의 KL divergence loss

image-level contrastive loss
region-text가 아닌, image-text pair에 대한 CLIP loss

2. Experiments

transfer learning for object detection

base category에 대해 class-wise weighted cross-entropy loss로 학습
(base category probability에 대해 focal scaling)
pre-trained model이 학습한 object concepts를 forgetting하는걸 막기 위해 focal scaling 사용

background category로 classify 안하도록, background category에 대해 all-zero embeddings로 fix

image image

Table 1, 2
our detector outperforms previous published SOTA on all metrics on COCO and LVIS
these detection results are achieved by using a single pre-trained backbone, with standard data augmentation and 1x training schedule
→ our region-based vision-language pre-training has learned better alignment between image regions and object concepts, and thus facilitates open-vocabulary object detection

fully supervised object detection image

Table 3
the detector initialized by our pre-trained visual backbone largely outperforms the baselines that are initialized by ImageNet and CLIP backbones
→ our proposed pre-training method helps the fully supervised detector converge faster and achieves better performance at 1x schedule

zero-shot inference for object detection

RPN objectness score과 category confidence score간의 geometry mean을 이용하여 zero-shot inference

image

Table 4
using GT (ground-truth bounding boxes) as region proposals
our pre-trained model outperforms CLIP baseline by a clear margin across datasets

using RPN (used in pre-training) as region proposals
our model still clearly outperforms CLIP and OVR

→ our pre-training method with region-text alignment improves the visual recognition ability for image regions

ablation study Table 5

Table 5 - effect of different pre-training supervision
the additional supervision from image-text pairs can further improve the performance
we suspect that image-text pairs provide extra contextual information from global image description which compensates our created region descriptions

Table 6

Table 6 - effect of region proposal quality during pre-training
random boxes hurt zero-shot inference while reserve comparable performance in transfer learning
zero-shot inference benefits from higher quality of proposals, but the gap becomes smaller when human supervision is available to fine-tune the model

Table 7

Table 7 - effect of pre-training dataset and concept pool
using COCO Cap dataset or using the COCO concepts achieves better zero-shot inference performance
we hypothesize that COCO Cap has a smaller domain gap to COCO detection dataset

model pre-trained on CC3M achieves significant boost on transfer learning
we conjecture that the model learns more generic visual representation from a larger number of images in CC3M

Table 8

Table 8 - effect of different losses
distillation-only achieves close results as contrastive + distillation on zero-shot inference
contrastive + distillation achieves best performance on transfer learning

distillation loss : helps to inherit the visual-semantic knowledge from the teacher model
contrastive loss : enforces more discriminative representations for transfer learning

image

Table 9 - effect of using different teacher and student model
large teacher model → improve zero-shot performance, same transfer learning performance
large student model → same zero-shot performance, improve transfer learning performance

zero-shot performance relies on the teacher model that guides the region-text alignment
transfer learning performance is more likely constrained by the capacity of student model

Table 10 - effect of focal scaling during transfer learning
with focal scaling, the fine-tuned detector achieves a better balance between novel categories and base categories
we conjecture that the detector overfits to the small set of base categories in COCO, which hurts the generalization on novel categories
→ focal scaling effectively alleviates the potential overfitting

visualization image

Figure 3 - results of zero-shot inference with GT boxes and 65 categories from COCO dataset
our model predicts more reasonable categories than CLIP
→ our proposed region-based vision-language pre-training can help to recognize image regions precisely

image

Figure 4 - zero-shot inference with GT boxes and 1203 categories from LVIS dataset
our model can correctly recognize the image regions than CLIP
our model can also predict reasonable categories with top-3 scores
our model can still recognize the image region as visually similar concepts

Metadata

Metadata

Assignees

No one assigned

    Labels

    DataRelated with dataDiscriminativeDiscriminative ModelingLanguageRelated with Natural Language Processing tasksVisionRelated with Computer Vision tasks

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions