PyTorch implementation and pretrained models for DINO. For details, see Emerging Properties in Self-Supervised Vision Transformers.
[arXiv]
This codebase has been developed with :
- python 3.9
- pytorch 1.12.0
- CUDA 11.3
- torchvision 0.13.0
Make sure to install the requirements: pip3 install -r requirements.txt
dino package should be included in the Python module search path :
export PYTHONPATH="${PYTHONPATH}:/path/to/your/dino"The dataset you intend to pretrain on should be structured as follow:
patch_pretraining/
└──imgs/
├── patch_1.jpg
├── patch_2.jpg
└── ...Where patch_pretraining/imgs/ is the directory of patches (e.g. in .jpg format) extracted using HS2P, used to pretrain the first Transformer block (ViT_patch).
In case you want to run hierarchical pretraining, you need to structure your data as follow:
region_pretraining/
├── slide_1_region_1.pt
├── slide_1_region_2.pt
└── ...Where region_pretraining/ is the directory of pre-extracted region-level features for each region, generated using python3 dino/extract_features.py. Each *.pt file is a [npatch × 384]-sized Tensor, which contains the sequence of pre-extracted ViT_patch features for each [patch_size × patch_size] patch in a given region. This folder is used to pretain the intermediate Transformer block (ViT_region).
In the following python commands, make sure to replace {gpu} with the number of gpus available for pretraining.
Update the config file dino/config/patch.yaml to match your local setup.
Then kick off distributed pretraining of a vanilla ViT-S/16 :
python3 -m torch.distributed.run --nproc_per_node={gpu} dino/patch.pyAlternatively, you can check notebooks/vanilla_dino.ipynb.
Update the config file dino/config/region.yaml to match your local setup.
Then kick off distributed pretraining of a ViT-S/4096_256 :
python3 -m torch.distributed.run --nproc_per_node={gpu} dino/region.pyAlternatively, you can check notebooks/hierarchical_dino.ipynb.
