Skip to content

Latest commit

 

History

History
 
 

preprocess

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Skip the data preparation

  • We’ve provided all the prepared data in Google Drive. Simply download the files and place them in the annotations/ directory. You’ll then be ready to run and test the code.

Prepare data

  • Download the ScanNet dataset by following the ScanNet instructions.

  • Extract object masks using a pretrained 3D detector:

    • Use Mask3D for instance segmentation. We used the checkpoint pretrained on ScanNet200.
    • The complete predicted results (especially the masks) for the train/validation sets are too large to share (~40GB). We’ve shared the post-processed results:
      • Unzip the mask3d_inst_seg.tar.gz file.
      • Each file under mask3d_inst_seg contains the predicted results for a single scene, including a list of segmented instances with their labels and segmented indices.
  • Process object masks and prepare annotations:

    • If you use Mask3D for instance segmentation, set the segment_result_dir in run_prepare.sh to the output directory of Mask3D.
    • If you use the downloaded mask3d_inst_seg directly, set segment_result_dir to None and set inst_seg_dir to the path of mask3d_inst_seg.
    • Run: bash preprocess/run_prepare.sh
  • Extract 3D features using a pretrained 3D encoder:

    • Follow Uni3D to extract 3D features for each instance. We used the pretrained model uni3d-g.
    • We've also provided modified code for feature extraction in this forked repository. Set the data_dir here to the path to ${processed_data_dir}/pcd_all (processed_data_dir is an intermediate directory set in run_prepare.sh). After preparing the environment, run bash scripts/inference.sh.
  • Extract 2D features using a pretrained 2D encoder:

    • We followed OpenScene's code to calculate the mapping between 3D points and 2D image pixels. This allows each object to be projected onto multi-view images. Based on the projected masks on the images, we extract and merge DINOv2 features from multi-view images for each object.

    • [TODO] Detailed implementation will be released.