CLIP, ALIGN과 같은 vision-language model의 등장으로, zero-shot classification이 가능해짐
그렇다면 object region만 잘 뽑는다면, zero-shot object detection도 가능하지 않을까?라고 생각할 수 있음
저자들은 simple R-CNN style object detector + pre-trained CLIP을 이용하여 실험해본 결과, 성능이 좋지 않음을 확인함
문제를 파악하기 위해, 2가지 scenario에서 테스트 진행
- object region proposal network로 구한 bounding box를 이용한 경우
Figure 1 (a) → CLIP score의 localization quality가 좋지 않음
- ground truth bounding box를 이용한 경우
Figure 1 (b) → classification accuracy가 매우 낮아짐
그렇다면 CLIP 성능이 왜 안좋은가?
CLIP은 (image, image-level text) data에 대해서 학습했지, (image region, region-level text) data에 대해 학습하지 않았음
즉, fine-grained alignment를 학습하지 않았기에 domain-shift 때문에 성능이 안좋은 것
이를 극복하는 가장 간단한 방법은 (image region, region-level text) data로 CLIP을 학습하면 됨
저자들은 pre-trained CLIP을 이용하여 (image region, region-level text) data를 생성하며, pseudo region-text pair data로 pre-train한 모델인 RegionCLIP을 제안함
pseudo region-text pair data는 noisy하지만, human annotation이 필요 없기에 scalable하다는 장점을 가짐
중요하다고 생각되는 부분만 간단히 요약
1. Method
intuition
goal : zero-shot object detection
object detection을 detection, recognition으로 나눌 수 있음
detection → 기존에 존재하는 object region proposal network 사용
recognition → image region에 대해 zero-shot classification을 잘할 수 있도록 학습하자
recognition 모델을 어떻게 학습해야 될까?
image region과 region description이 match하도록 visual encoder를 잘 학습하면 됨
(language encoder는 학습 X)
하지만 학습할만한 large-scale (image region, region-level text) data가 없음
어떻게 (image region, region-level text) data를 확보할 수 있을까?
zero-shot object detection을 할 수 있으려면 다양한 object concept을 학습해야함
즉, (image, image-level text) data를 사용하는 것이 목적에 맞음
(image, image-level text) data를 (image region, region-level text) data로 만드는 데 2가지 문제가 있음
- (image, image-level text) data에는 (image region, region-level text)와 같은 fine-grained alignment가 존재하지 않음
→ pre-trained CLIP 이용해서 pseudo (image region, region-level text) data 만들자
- image 내에 있는 모든 object가 image-level text description에 담겨있지 않음
→ object concept pool 구축 + template 이용해서 region-level text description 만들자
pseudo region-text pair dataset construction
image region feature 추출 방법
- off-the-shelf object localizers (ex. RPN) 이용해서 image regions 추출
- vision encoder + feature pooling method (ex. RoIAlign)을 이용하여 region feature 추출
text feature 추출 방법
- off-the-shelf language parsers를 이용하여 object concepts pool 생성
- CLIP prompt template을 이용하여 text description 생성
- pre-trained CLIP text encoder를 이용하여 text feature 뽑음
pre-trained CLIP을 이용하여 pseudo region-text pair data 구축
region-based language-image pre-training
3가지 loss로 pre-training
→ region-text contrastive loss, region-text distillation loss, image-level contrastive loss
region-text contrastive loss
CLIP loss와 다른 점 : text loss가 없음
(1개의 text prompt와 images간의 contrastive loss X)
temperature $\tau = 0.01$ 사용
region-text distillation loss
teacher model과 student model간의 KL divergence loss
image-level contrastive loss
region-text가 아닌, image-text pair에 대한 CLIP loss
2. Experiments
transfer learning for object detection
base category에 대해 class-wise weighted cross-entropy loss로 학습
(base category probability에 대해 focal scaling)
pre-trained model이 학습한 object concepts를 forgetting하는걸 막기 위해 focal scaling 사용
background category로 classify 안하도록, background category에 대해 all-zero embeddings로 fix
Table 1, 2
our detector outperforms previous published SOTA on all metrics on COCO and LVIS
these detection results are achieved by using a single pre-trained backbone, with standard data augmentation and 1x training schedule
→ our region-based vision-language pre-training has learned better alignment between image regions and object concepts, and thus facilitates open-vocabulary object detection
fully supervised object detection
Table 3
the detector initialized by our pre-trained visual backbone largely outperforms the baselines that are initialized by ImageNet and CLIP backbones
→ our proposed pre-training method helps the fully supervised detector converge faster and achieves better performance at 1x schedule
zero-shot inference for object detection
RPN objectness score과 category confidence score간의 geometry mean을 이용하여 zero-shot inference
Table 4
using GT (ground-truth bounding boxes) as region proposals
our pre-trained model outperforms CLIP baseline by a clear margin across datasets
using RPN (used in pre-training) as region proposals
our model still clearly outperforms CLIP and OVR
→ our pre-training method with region-text alignment improves the visual recognition ability for image regions
ablation study
Table 5 - effect of different pre-training supervision
the additional supervision from image-text pairs can further improve the performance
we suspect that image-text pairs provide extra contextual information from global image description which compensates our created region descriptions
Table 6 - effect of region proposal quality during pre-training
random boxes hurt zero-shot inference while reserve comparable performance in transfer learning
zero-shot inference benefits from higher quality of proposals, but the gap becomes smaller when human supervision is available to fine-tune the model
Table 7 - effect of pre-training dataset and concept pool
using COCO Cap dataset or using the COCO concepts achieves better zero-shot inference performance
we hypothesize that COCO Cap has a smaller domain gap to COCO detection dataset
model pre-trained on CC3M achieves significant boost on transfer learning
we conjecture that the model learns more generic visual representation from a larger number of images in CC3M
Table 8 - effect of different losses
distillation-only achieves close results as contrastive + distillation on zero-shot inference
contrastive + distillation achieves best performance on transfer learning
distillation loss : helps to inherit the visual-semantic knowledge from the teacher model
contrastive loss : enforces more discriminative representations for transfer learning
Table 9 - effect of using different teacher and student model
large teacher model → improve zero-shot performance, same transfer learning performance
large student model → same zero-shot performance, improve transfer learning performance
zero-shot performance relies on the teacher model that guides the region-text alignment
transfer learning performance is more likely constrained by the capacity of student model
Table 10 - effect of focal scaling during transfer learning
with focal scaling, the fine-tuned detector achieves a better balance between novel categories and base categories
we conjecture that the detector overfits to the small set of base categories in COCO, which hurts the generalization on novel categories
→ focal scaling effectively alleviates the potential overfitting
visualization
Figure 3 - results of zero-shot inference with GT boxes and 65 categories from COCO dataset
our model predicts more reasonable categories than CLIP
→ our proposed region-based vision-language pre-training can help to recognize image regions precisely
Figure 4 - zero-shot inference with GT boxes and 1203 categories from LVIS dataset
our model can correctly recognize the image regions than CLIP
our model can also predict reasonable categories with top-3 scores
our model can still recognize the image region as visually similar concepts
CLIP, ALIGN과 같은 vision-language model의 등장으로, zero-shot classification이 가능해짐
그렇다면 object region만 잘 뽑는다면, zero-shot object detection도 가능하지 않을까?라고 생각할 수 있음
저자들은 simple R-CNN style object detector + pre-trained CLIP을 이용하여 실험해본 결과, 성능이 좋지 않음을 확인함
문제를 파악하기 위해, 2가지 scenario에서 테스트 진행
Figure 1 (a)→ CLIP score의 localization quality가 좋지 않음Figure 1 (b)→ classification accuracy가 매우 낮아짐그렇다면 CLIP 성능이 왜 안좋은가?
CLIP은 (image, image-level text) data에 대해서 학습했지, (image region, region-level text) data에 대해 학습하지 않았음
즉, fine-grained alignment를 학습하지 않았기에 domain-shift 때문에 성능이 안좋은 것
이를 극복하는 가장 간단한 방법은 (image region, region-level text) data로 CLIP을 학습하면 됨
저자들은 pre-trained CLIP을 이용하여 (image region, region-level text) data를 생성하며, pseudo region-text pair data로 pre-train한 모델인 RegionCLIP을 제안함
pseudo region-text pair data는 noisy하지만, human annotation이 필요 없기에 scalable하다는 장점을 가짐
중요하다고 생각되는 부분만 간단히 요약
1. Method
intuition
pseudo region-text pair dataset construction
region-based language-image pre-training
2. Experiments
transfer learning for object detection
fully supervised object detection
zero-shot inference for object detection
ablation study
visualization