[2022 CVPR] RegionCLIP: Region-based Language-Image Pretraining

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/b6f1b4e2-86a6-4a9d-ac18-ae9fadca24d5">

CLIP, ALIGN과 같은 vision-language model의 등장으로, zero-shot classification이 가능해짐

그렇다면 object region만 잘 뽑는다면, zero-shot object detection도 가능하지 않을까?라고 생각할 수 있음

저자들은 simple R-CNN style object detector + pre-trained CLIP을 이용하여 실험해본 결과, 성능이 좋지 않음을 확인함

문제를 파악하기 위해, 2가지 scenario에서 테스트 진행

1. object region proposal network로 구한 bounding box를 이용한 경우
  `Figure 1 (a)` → CLIP score의 localization quality가 좋지 않음
2. ground truth bounding box를 이용한 경우
  `Figure 1 (b)` → classification accuracy가 매우 낮아짐

그렇다면 CLIP 성능이 왜 안좋은가?

CLIP은 (image, image-level text) data에 대해서 학습했지, (image region, region-level text) data에 대해 학습하지 않았음

즉, fine-grained alignment를 학습하지 않았기에 domain-shift 때문에 성능이 안좋은 것

이를 극복하는 가장 간단한 방법은 (image region, region-level text) data로 CLIP을 학습하면 됨

저자들은 pre-trained CLIP을 이용하여 (image region, region-level text) data를 생성하며, pseudo region-text pair data로 pre-train한 모델인 RegionCLIP을 제안함

pseudo region-text pair data는 noisy하지만, human annotation이 필요 없기에 scalable하다는 장점을 가짐

중요하다고 생각되는 부분만 간단히 요약


## 1. Method
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/cc915b35-3cd1-435c-aa5c-2b76f3191876">


<details><summary>intuition</summary>

> goal : zero-shot object detection

> object detection을 detection, recognition으로 나눌 수 있음
> detection → 기존에 존재하는 object region proposal network 사용
> recognition → image region에 대해 zero-shot classification을 잘할 수 있도록 학습하자

> recognition 모델을 어떻게 학습해야 될까?
> image region과 region description이 match하도록 visual encoder를 잘 학습하면 됨
> (language encoder는 학습 X)
> 하지만 학습할만한 large-scale (image region, region-level text) data가 없음

> 어떻게 (image region, region-level text) data를 확보할 수 있을까?
> zero-shot object detection을 할 수 있으려면 다양한 object concept을 학습해야함
> 즉, (image, image-level text) data를 사용하는 것이 목적에 맞음

> (image, image-level text) data를 (image region, region-level text) data로 만드는 데 2가지 문제가 있음
> 1. (image, image-level text) data에는 (image region, region-level text)와 같은 fine-grained alignment가 존재하지 않음
> → pre-trained CLIP 이용해서 pseudo (image region, region-level text) data 만들자
> 2. image 내에 있는 모든 object가 image-level text description에 담겨있지 않음
> → object concept pool 구축 + template 이용해서 region-level text description 만들자

</details>


<details><summary>pseudo region-text pair dataset construction</summary>

> image region feature 추출 방법
> 1. off-the-shelf object localizers (ex. RPN) 이용해서 image regions 추출
> 2. vision encoder + feature pooling method (ex. RoIAlign)을 이용하여 region feature 추출

> text feature 추출 방법
> 1. off-the-shelf language parsers를 이용하여 object concepts pool 생성
> 2. CLIP prompt template을 이용하여 text description 생성
> 3. pre-trained CLIP text encoder를 이용하여 text feature 뽑음

> pre-trained CLIP을 이용하여 pseudo region-text pair data 구축

</details>


<details><summary>region-based language-image pre-training</summary>

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/b9058b90-cc12-4fa6-8a44-94e0beb5032b">

> 3가지 loss로 pre-training
> → region-text contrastive loss, region-text distillation loss, image-level contrastive loss

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/e56ff731-350b-4462-ba23-0b6ffefcc944">

> region-text contrastive loss
> CLIP loss와 다른 점 : text loss가 없음
> (1개의 text prompt와 images간의 contrastive loss X)
> temperature $\tau = 0.01$ 사용

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/3225ebe1-c609-490a-bc0e-d11ec6b57db8">

> region-text distillation loss
> teacher model과 student model간의 KL divergence loss

> image-level contrastive loss
> region-text가 아닌, image-text pair에 대한 CLIP loss

</details>


## 2. Experiments

<details><summary>transfer learning for object detection</summary>

> base category에 대해 class-wise weighted cross-entropy loss로 학습
> (base category probability에 대해 focal scaling)
> pre-trained model이 학습한 object concepts를 forgetting하는걸 막기 위해 focal scaling 사용
> 
> background category로 classify 안하도록, background category에 대해 all-zero embeddings로 fix

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/8beb0dda-8ed1-4a00-b1e9-804de03a349a">
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/66a0239f-91da-4dda-8a8d-15a4e5ea15d4">

> `Table 1, 2`
> our detector outperforms previous published SOTA on all metrics on COCO and LVIS
> these detection results are achieved by using a single pre-trained backbone, with standard data augmentation and 1x training schedule
> → our region-based vision-language pre-training has learned better alignment between image regions and object concepts, and thus facilitates open-vocabulary object detection

</details>


<details><summary>fully supervised object detection</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/8835d6f9-5ed0-4ca0-95ae-049d97d2c1c7">

> `Table 3`
> the detector initialized by our pre-trained visual backbone largely outperforms the baselines that are initialized by ImageNet and CLIP backbones
> → our proposed pre-training method helps the fully supervised detector converge faster and achieves better performance at 1x schedule

</details>


<details><summary>zero-shot inference for object detection</summary>

> RPN objectness score과 category confidence score간의 geometry mean을 이용하여 zero-shot inference

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/74b79ed5-ffb4-4bf7-b52c-2b308c46a6e1">

> `Table 4`
> using GT (ground-truth bounding boxes) as region proposals
> our pre-trained model outperforms CLIP baseline by a clear margin across datasets
> 
> using RPN (used in pre-training) as region proposals
> our model still clearly outperforms CLIP and OVR
> 
> → our pre-training method with region-text alignment improves the visual recognition ability for image regions

</details>


<details><summary>ablation study</summary>
<img width="50%" alt="Table 5" src="https://github.com/user-attachments/assets/3bf1d41d-451d-4c39-96bb-d826445f461d">

> `Table 5` - effect of different pre-training supervision
> the additional supervision from image-text pairs can further improve the performance 
> we suspect that image-text pairs provide extra contextual information from global image description which compensates our created region descriptions

<img width="50%" alt="Table 6" src="https://github.com/user-attachments/assets/f64ec745-538e-408c-bb3d-8cbbe62823d5">

> `Table 6` - effect of region proposal quality during pre-training
> random boxes hurt zero-shot inference while reserve comparable performance in transfer learning
> zero-shot inference benefits from higher quality of proposals, but the gap becomes smaller when human supervision is available to fine-tune the model

<img width="50%" alt="Table 7" src="https://github.com/user-attachments/assets/de74ad61-9e8d-4055-bd9f-514c2c7c26bb">

> `Table 7` - effect of pre-training dataset and concept pool
> using COCO Cap dataset or using the COCO concepts achieves better zero-shot inference performance
> we hypothesize that COCO Cap has a smaller domain gap to COCO detection dataset
> 
> model pre-trained on CC3M achieves significant boost on transfer learning
> we conjecture that the model learns more generic visual representation from a larger number of images in CC3M

<img width="50%" alt="Table 8" src="https://github.com/user-attachments/assets/2cf83143-462c-4d8d-bc98-67d5008a45b3">

> `Table 8` - effect of different losses
> distillation-only achieves close results as contrastive + distillation on zero-shot inference
> contrastive + distillation achieves best performance on transfer learning
> 
> distillation loss : helps to inherit the visual-semantic knowledge from the teacher model
> contrastive loss : enforces more discriminative representations for transfer learning

<img width="100%" alt="image" src="https://github.com/user-attachments/assets/a634d9c3-b5ca-4f54-a62f-bee3ecd4df31">

> `Table 9` - effect of using different teacher and student model
> large teacher model → improve zero-shot performance, same transfer learning performance
> large student model → same zero-shot performance, improve transfer learning performance
>
> zero-shot performance relies on the teacher model that guides the region-text alignment
> transfer learning performance is more likely constrained by the capacity of student model

> `Table 10` - effect of focal scaling during transfer learning
> with focal scaling, the fine-tuned detector achieves a better balance between novel categories and base categories 
> we conjecture that the detector overfits to the small set of base categories in COCO, which hurts the generalization on novel categories
> → focal scaling effectively alleviates the potential overfitting

</details>


<details><summary>visualization</summary>
<img width="100%" alt="image" src="https://github.com/user-attachments/assets/8ff0ff9a-de1c-45ef-a811-900e0a79aaf6">

> `Figure 3` - results of zero-shot inference with GT boxes and 65 categories from COCO dataset
> our model predicts more reasonable categories than CLIP
> → our proposed region-based vision-language pre-training can help to recognize image regions precisely

<img width="50%" alt="image" src="https://github.com/user-attachments/assets/af6194f1-3ace-409b-9c4a-23b2e4d7d355">

> `Figure 4` - zero-shot inference with GT boxes and 1203 categories from LVIS dataset
> our model can correctly recognize the image regions than CLIP
> our model can also predict reasonable categories with top-3 scores
> our model can still recognize the image region as visually similar concepts

</details>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2022 CVPR] RegionCLIP: Region-based Language-Image Pretraining #222

1. Method

2. Experiments

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[2022 CVPR] RegionCLIP: Region-based Language-Image Pretraining #222

Description

1. Method

2. Experiments

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions