- We’ve provided all the prepared data in Google Drive. Simply download the files and place them in the annotations/ directory. You’ll then be ready to run and test the code.
-
Download the ScanNet dataset by following the ScanNet instructions.
-
Extract object masks using a pretrained 3D detector:
- Use Mask3D for instance segmentation. We used the checkpoint pretrained on ScanNet200.
- The complete predicted results (especially the masks) for the train/validation sets are too large to share (~40GB). We’ve shared the post-processed results:
- Unzip the
mask3d_inst_seg.tar.gz
file. - Each file under
mask3d_inst_seg
contains the predicted results for a single scene, including a list of segmented instances with their labels and segmented indices.
- Unzip the
-
Process object masks and prepare annotations:
- If you use Mask3D for instance segmentation, set the
segment_result_dir
in run_prepare.sh to the output directory of Mask3D. - If you use the downloaded
mask3d_inst_seg
directly, setsegment_result_dir
to None and setinst_seg_dir
to the path ofmask3d_inst_seg
. - Run:
bash preprocess/run_prepare.sh
- If you use Mask3D for instance segmentation, set the
-
Extract 3D features using a pretrained 3D encoder:
- Follow Uni3D to extract 3D features for each instance. We used the pretrained model uni3d-g.
- We've also provided modified code for feature extraction in this forked repository. Set the
data_dir
here to the path to${processed_data_dir}/pcd_all
(processed_data_dir
is an intermediate directory set inrun_prepare.sh
). After preparing the environment, runbash scripts/inference.sh
.
-
Extract 2D features using a pretrained 2D encoder:
-
We followed OpenScene's code to calculate the mapping between 3D points and 2D image pixels. This allows each object to be projected onto multi-view images. Based on the projected masks on the images, we extract and merge DINOv2 features from multi-view images for each object.
-
[TODO] Detailed implementation will be released.
-