We provide models for scale-consistent depth estimation and view synthesis with 3D Gaussian splatting.
- The depth models are trained with the following procedure:
- Initialize the monocular feature with Depth Anything V2 and the multi-view Transformer with UniMatch.
- Train the full DepthSplat model end-to-end on the RealEstate10K dataset.
- Fine-tune the pre-trained depth model on the depth datasets with ground truth depth supervision. The depth datasets used for fine-tuning include ScanNet, DeMoN, TartanAir, and VKITTI2.
- All the depth models are fine-tuned with two images as input, the training image resolution is 352x640.
- The scale of the predicted depth is aligned with the scale of camera pose's translation.
Model | Monocular | Multi-View | Params (M) | Download |
---|---|---|---|---|
depthsplat-depth-small | ViT-S | 1-scale | 36 | download |
depthsplat-depth-base | ViT-B | 2-scale | 111 | download |
depthsplat-depth-large | ViT-L | 2-scale | 338 | download |
- The models are trained on RealEstate10K and/or DL3DV datasets at 256x256 or 256x448 resolutions.
- We plan to release more high-resolution models in the future.
Model | Monocular | Multi-View | Params (M) | Download |
---|---|---|---|---|
depthsplat-gs-small-re10k-256x256 | ViT-S | 1-scale | 37 | download |
depthsplat-gs-base-re10k-256x256 | ViT-B | 2-scale | 117 | download |
depthsplat-gs-large-re10k-256x256 | ViT-L | 2-scale | 360 | download |
depthsplat-gs-base-dl3dv-256x448 | ViT-B | 2-scale | 117 | download |