Skip to content

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection.

License

Notifications You must be signed in to change notification settings

longzw1997/Open-GroundingDino

Repository files navigation

This is the third party implementation of the paper Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection by Zuwei Long and Wei Li.

You can use this code to fine-tune a model on your own dataset, or start pretraining a model from scratch.

Supported Features

Official release version The version we replicated
Inference
Train (Object Detection data)
Train (Grounding data)
Slurm multi-machine support
Training acceleration strategy

Setup

We conduct our model testing using the following versions: Python 3.7.11, PyTorch 1.11.0, and CUDA 11.3. It is possible that other versions are also available.

  1. Clone this repository.
git clone https://github.com/longzw1997/Open-GroundingDino.git && cd Open-GroundingDino/
  1. Install the required dependencies.
pip install -r requirements.txt 
cd models/GroundingDINO/ops
python setup.py build install
python test.py
cd ../../..
  1. Download pre-trained model and BERT weights, then modify the corresponding paths in the train/test script.

Dataset

For training, we use the odvg data format to support both OD data and VG data.
Before model training begins, you need to convert your dataset into odvg format, see data_format.md | datasets_mixed_odvg.json | coco2odvg.py | grit2odvg for more details.

For testing, we use coco format, which currently only supports OD datasets.

mixed dataset
{
  "train": [
    {
      "root": "path/V3Det/",
      "anno": "path/V3Det/annotations/v3det_2023_v1_all_odvg.jsonl",
      "label_map": "path/V3Det/annotations/v3det_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/LVIS/train2017/",
      "anno": "path/LVIS/annotations/lvis_v1_train_odvg.jsonl",
      "label_map": "path/LVIS/annotations/lvis_v1_train_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/Objects365/train/",
      "anno": "path/Objects365/objects365_train_odvg.json",
      "label_map": "path/Objects365/objects365_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/coco_2017/train2017/",
      "anno": "path/coco_2017/annotations/coco2017_train_odvg.jsonl",
      "label_map": "path/coco_2017/annotations/coco2017_label_map.json",
      "dataset_mode": "odvg"
    },
    {
      "root": "path/GRIT-20M/data/",
      "anno": "path/GRIT-20M/anno/grit_odvg_620k.jsonl",
      "dataset_mode": "odvg"
    }, 
    {
      "root": "path/flickr30k/images/flickr30k_images/",
      "anno": "path/flickr30k/annotations/flickr30k_entities_odvg_158k.jsonl",
      "dataset_mode": "odvg"
    }
  ],
  "val": [
    {
      "root": "path/coco_2017/val2017",
      "anno": "config/instances_val2017.json",
      "label_map": null,
      "dataset_mode": "coco"
    }
  ]
}
example for odvg dataset
# For OD
{"filename": "000000391895.jpg", "height": 360, "width": 640, "detection": {"instances": [{"bbox": [359.17, 146.17, 471.62, 359.74], "label": 3, "category": "motorcycle"}, {"bbox": [339.88, 22.16, 493.76, 322.89], "label": 0, "category": "person"}, {"bbox": [471.64, 172.82, 507.56, 220.92], "label": 0, "category": "person"}, {"bbox": [486.01, 183.31, 516.64, 218.29], "label": 1, "category": "bicycle"}]}}
{"filename": "000000522418.jpg", "height": 480, "width": 640, "detection": {"instances": [{"bbox": [382.48, 0.0, 639.28, 474.31], "label": 0, "category": "person"}, {"bbox": [234.06, 406.61, 454.0, 449.28], "label": 43, "category": "knife"}, {"bbox": [0.0, 316.04, 406.65, 473.53], "label": 55, "category": "cake"}, {"bbox": [305.45, 172.05, 362.81, 249.35], "label": 71, "category": "sink"}]}}

# For VG
{"filename": "014127544.jpg", "height": 400, "width": 600, "grounding": {"caption": "Homemade Raw Organic Cream Cheese for less than half the price of store bought! It's super easy and only takes 2 ingredients!", "regions": [{"bbox": [5.98, 2.91, 599.5, 396.55], "phrase": "Homemade Raw Organic Cream Cheese"}]}}
{"filename": "012378809.jpg", "height": 252, "width": 450, "grounding": {"caption": "naive : Heart graphics in a notebook background", "regions": [{"bbox": [93.8, 47.59, 126.19, 77.01], "phrase": "Heart graphics"}, {"bbox": [2.49, 1.44, 448.74, 251.1], "phrase": "a notebook background"}]}}

Config

config/cfg_odvg.py                   # for backbone, batch size, LR, freeze layers, etc.
config/datasets_mixed_odvg.json      # support mixed dataset for both OD and VG

Training

  • Datasets: before starting the training, you need to modify the config/datasets_mixed_example.json according to data_format.md.
  • Configs: defaults to using coco_val2017 for evaluation.
    • If you are evaluating with your own test set, you need to convert the test data to coco format (not the ovdg format) and modify the config to set use_coco_eval = False (The COCO dataset has 80 classes used for training but 90 categories in total, so there is a built-in mapping in the code).
    • Also, add(or update) the label_list in the config with your own class names like label_list=['dog', 'cat', 'person'].
- use_coco_eval = True
+ use_coco_eval = False
+ label_list=['dog', 'cat', 'person']
  • Train/Eval:
# train/eval on torch.distributed.launch:
bash train_dist.sh  ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
bash test_dist.sh  ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}

# train/eval on slurm cluster:
bash train_slurm.sh  ${PARTITION} ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
bash test_slurm.sh  ${PARTITION} ${GPU_NUM} ${CFG} ${DATASETS} ${OUTPUT_DIR}
# e.g.  check train_slurm.sh for more details
# bash train_slurm.sh v100_32g 32 config/cfg_odvg.py config/datasets_mixed_odvg.json ./logs
# bash train_slurm.sh v100_32g 8 config/cfg_coco.py config/datasets_od_example.json ./logs

Results and Models

Name Pretrain data Task mAP on COCO Ckpt Misc
GroundingDINO-T
(offical)
O365,GoldG,Cap4M zero-shot 48.4
(zero-shot)
model -
GroundingDINO-T
(fine-tune)
O365,GoldG,Cap4M finetune
w/ coco
57.3
(fine-tune)
model cfg | log
GroundingDINO-T
(pretrain)
COCO,O365,LIVS,V3Det,
GRIT-200K,Flickr30k(total 1.8M)
zero-shot 55.1
(zero-shot)
model cfg | log

Inference

Because the model architecture has not changed, you only need to install GroundingDINO library and then run inference_on_a_image.py to inference your images.

python tools/inference_on_a_image.py \
  -c tools/GroundingDINO_SwinT_OGC.py \
  -p path/to/your/ckpt.pth \
  -i ./figs/dog.jpeg \
  -t "dog" \
  -o output
Prompt Official ckpt COCO ckpt 1.8M ckpt
dog
cat

Acknowledgments

Provided codes were adapted from:

Citation

@misc{Open Grounding Dino,
  author = {Zuwei Long, Wei Li},
  title = {Open Grounding Dino:The third party implementation of the paper Grounding DINO},
  howpublished = {\url{https://github.com/longzw1997/Open-GroundingDino}},
  year = {2023}
}

Contact

  • longzuwei at sensetime.com
  • liwei1 at sensetime.com

Feel free to contact we if you have any suggestions or questions. Bugs found are also welcome. Please create a pull request if you find any bugs or want to contribute code.