Click Here to Download Pre-trained Models behind the above visualizations
We design and implement Grendel-GS, which serves as a distributed implementation of 3D Gaussian Splatting training. We aim to help 3DGS achieve its scaling laws with distributed system support, just as the achievements of current LLMs rely on distributed system.
By using Grendel, your 3DGS training could leverage multiple GPUs' capability to achieve significantly faster training, supports a substantially more Gaussians in GPU memory, and ultimately allows for the reconstruction of larger-area, higher-resolution scenes to better PSNR. Grendel-GS retains the original algorithm, making it a direct and safe replacement for original 3DGS implementation in any Gaussian Splatting workflow or application.
For examples, with 4 GPU, Grendel-GS allows you to:
- Train Mip360 >3.5 times faster.
- Support directly training large-scale 4K scenes(Mega-NeRF Rubble) using >40 millions gaussians without OOM.
- Train the Temple&Tanks Truck scene to PSNR 23.79 within merely 45 seconds (on 7000 images)
- 7.15.2024 - We now support gsplat as the CUDA backend during training!
🌟 Follow us for future updates! Interested in collaborating or contributing? Email us!
Here is a diagram showing why you may need distributed gaussian splatting training like our Grendel-GS' techniques:
This repo and its dependency, our customized distributed version of rendering cuda code(diff-gaussian-rasterization), are both forks from the original 3DGS implementation. Therefore, the usage is generally very similar to the original 3DGS.
The two main differences are:
- We support training on multiple GPUs, using the
torchrun
command-line utility provided by PyTorch to launch jobs. - We support batch sizes greater than 1, with the
--bsz
argument flag used to specify the batch size.
The repository contains submodules, thus please check it out with
git clone [email protected]:nyu-systems/Grendel-GS.git --recursive
Ensure you have Conda, GPU with compatible driver and cuda environment installed on your machine, as prerequisites. Then please install PyTorch
, Torchvision
, Plyfile
, tqdm
which are essential packages. Make sure PyTorch version >= 1.10 to have torchrun for distributed training. Finally, compile and install two dependent cuda repo diff-gaussian-rasterization
and simple-knn
containing our customized cuda kernels for rendering and etc.
We provide a yml file for easy environment setup. However, you should choose the versions to match your local running environment.
conda env create --file environment.yml
conda activate gaussian_splatting
NOTES: We kept additional dependencies minimal compared to the original 3DGS. For environment setup issues, maybe you could refer to the original 3DGS repo issue section first.
We use colmap format to load dataset. Therefore, please download and unzip colmap datasets before trainning, for example Mip360 dataset and 4 scenes from Tanks&Temple and DeepBlending.
For single-GPU non-distributed training with batch size of 1:
python train.py -s <path to COLMAP dataset> --eval
For 4 GPU distributed training and batch size of 4:
torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py --bsz 4 -s <path to COLMAP dataset> --eval
Command Line Arguments for train.py
Path to the source directory containing a COLMAP data set.
Path where the trained model and loggings should be stored (/tmp/gaussian_splatting
by default).
Add this flag to use a MipNeRF360-style training/test split for evaluation.
The batch size(the number of camera views) in single step training. 1
by default.
The CUDA backend to use in training. Valid options include diff
(diff-gaussian-rasterization) and gsplat
. diff
by default.
The mode of scaling learning rate given larger batch size. sqrt
by default.
Save all groundtruth images from the dataset in GPU, rather than load each image on-the-fly at each training step. If dataset is large, preload_dataset_to_gpu will lead to OOM; when the dataset is small, preload_dataset_to_gpu could speed up the training a little bit by avoiding some cpu-gpu communication.
Number of total iterations to train for, 30_000
by default.
Space-separated iterations at which the training script computes L1 and PSNR over test set, 7000 30000
by default.
Space-separated iterations at which the training script saves the Gaussian model, 7000 30000 <iterations>
by default.
Space-separated iterations at which to store a checkpoint for continuing later, saved in the model directory.
Path to a saved checkpoint to continue training from.
Add this flag to use white background instead of black (default), e.g., for evaluation of NeRF Synthetic dataset.
Order of spherical harmonics to be used (no larger than 3). 3
by default.
Spherical harmonics features learning rate, 0.0025
by default.
Opacity learning rate, 0.05
by default.
Scaling learning rate, 0.005
by default.
Rotation learning rate, 0.001
by default.
Number of steps (from 0) where position learning rate goes from initial
to final
. 30_000
by default.
Initial 3D position learning rate, 0.00016
by default.
Final 3D position learning rate, 0.0000016
by default.
Position learning rate multiplier (cf. Plenoxels), 0.01
by default.
Iteration where densification starts, 500
by default.
Iteration where densification stops, 15_000
by default.
Limit that decides if points should be densified based on 2D position gradient, 0.0002
by default.
How frequently to densify, 100
(every 100 iterations) by default.
How frequently to reset opacity, 3_000
by default.
Influence of SSIM on total loss from 0 to 1, 0.2
by default.
Percentage of scene extent (0--1) a point must exceed to be forcibly densified, 0.01
by default.
python render.py -s <path to COLMAP dataset> --model_path <path to folder of saving model>
Command Line Arguments for render.py
Path to the trained model directory you want to create renderings for.
Flag to skip rendering the training set.
Flag to skip rendering the test set.
If point cloud models are saved distributedly during training, we should set this flag to load all of them.
Flag to omit any text written to standard out pipe.
The below parameters will be read automatically from the model path, based on what was used for training. However, you may override them by providing them explicitly on the command line.
Path to the source directory containing a COLMAP or Synthetic NeRF data set.
Alternative subdirectory for COLMAP images (images
by default).
Add this flag to use a MipNeRF360-style training/test split for evaluation.
The training/test split ratio in the whole dataset for evaluation. llffhold=8 means 1/8 is used as test set and others are used as train set.
Add this flag to use white background instead of black (default), e.g., for evaluation of NeRF Synthetic dataset.
For interactive rendering, please refer to GaussFusion, which also support rendering two checkpoints with interactive controls.
python metrics.py --model_path <path to folder of saving model>
Command Line Arguments for metrics.py
Space-separated list of model paths for which metrics should be computed.
If you are currently using the original 3DGS codebase for training in your application, you can effortlessly switch to our codebase because we haven't made any algorithmic changes. This will allow you to train faster and successfully train larger, higher-precision scenes without running out of memory (OOM) within a reasonable time frame.
It is worth noting that we only support the training functionality; this repository does not include the interactive viewer, network viewer, or colmap features from the original 3DGS. We are actively developing to support more features. Please let us know your needs or directly contribute to our project. Thank you!
30k Train Time(min) | stump | bicycle | kitchen | room | counter | garden | bonsai |
---|---|---|---|---|---|---|---|
1 GPU + Batch Size=1 | 24.03 | 30.18 | 25.58 | 22.45 | 21.6 | 30.15 | 19.18 |
4 GPU + Batch Size=1 | 9.07 | 11.67 | 9.53 | 8.93 | 8.82 | 10.85 | 8.03 |
4 GPU + Batch Size=4 | 5.22 | 6.47 | 6.98 | 6.18 | 5.98 | 6.48 | 5.28 |
30k Test PSNR | stump | bicycle | kitchen | room | counter | garden | bonsai |
---|---|---|---|---|---|---|---|
1 GPU + Batch Size=1 | 26.61 | 25.21 | 31.4 | 31.4 | 28.93 | 27.27 | 32.01 |
4 GPU + Batch Size=1 | 26.65 | 25.19 | 31.41 | 31.38 | 28.98 | 27.28 | 31.92 |
4 GPU + Batch Size=4 | 26.59 | 25.17 | 31.37 | 31.32 | 28.98 | 27.2 | 31.94 |
- Download and unzip the Mip360 dataset.
- Activate the appropriate conda/python environment.
- To execute all experiments and generate this table, run the following command:
bash examples/mip360/eval_all_mip360.sh <path_to_save_experiment_results> <path_to_mip360_dataset>
Configuration | 50k Training Time | Memory Per GPU | PSNR |
---|---|---|---|
bicycle + 1 GPU + Batch Size=1 | 2h 38min | 37.18 | 23.78 |
bicycle + 4 GPU + Batch Size=1 | 0h 50min | 10.39 | 23.79 |
garden + 1 GPU + Batch Size=1 | 2h 49min | 29.87 | 26.06 |
garden + 4 GPU + Batch Size=1 | 0h 50min | 7.88 | 26.06 |
Unlike the typical approach of downsampling the Mip360 dataset by a factor of four before training, our system can train directly at full resolution. The bicycle and garden images have resolutions of 4946x3286 and 5187x3361, respectively. Our distributed system demonstrates that we can significantly accelerate and reduce memory usage per GPU by several folds without sacrificing quality.
Set up the dataset and Python environment as outlined previously, then execute the following:
bash examples/mip360_4k/eval_mip360_4k.sh <path_to_save_experiment_results> <path_to_mip360_dataset>
Configuration | 7k Training Time | 7k test PSNR | 30k Training Time | 30k test PSNR |
---|---|---|---|---|
train + 4 GPU + Batch Size=8 | 44s | 19.37 | 3min 30s | 21.87 |
truck + 4 GPU + Batch Size=8 | 45s | 23.79 | 3min 39s | 25.35 |
Tanks&Temples dataset includes train and truck scenes with resolutions of 980x545 and 979x546, respectively. Utilizing 4 GPUs, we've managed to train on these small scenes to a reasonable quality in just 45 seconds(7k iterations). In the original Gaussian splatting papers, achieving a test PSNR of 18.892 and 23.506 at 7K resolution was considered good on train and truck, respectively. Our results are comparable to these benchmarks.
Set up the Tanks&Temple and DeepBlending Dataset and Python environment as outlined previously, then execute the following:
bash examples/train_truck_1k/eval_train_truck_1k.sh <path_to_save_experiment_results> <path_to_tandb_dataset>
(TODO: check these scripts have no side-effects)
- Hardware: 4x 40GB NVIDIA A100 GPUs
- Interconnect: Fully-connected Bidirectional 25GB/s NVLINK
- We will release our optimized cuda kernels within gaussian splatting soon for further speed up.
- We will support gsplat later as another choice of our cuda kernel backend.
Our system design, analysis of large-batch training dynamics, and insights from scaling up are all documented in the paper below:
On Scaling Up 3D Gaussian Splatting Training
Hexu Zhao¹, Haoyang Weng¹*, Daohan Lu¹*, Ang Li², Jinyang Li¹, Aurojit Panda¹, Saining Xie¹ (* co-second authors)
¹New York University, ²Pacific Northwest National Laboratory
@misc{zhao2024scaling3dgaussiansplatting,
title={On Scaling Up 3D Gaussian Splatting Training},
author={Hexu Zhao and Haoyang Weng and Daohan Lu and Ang Li and Jinyang Li and Aurojit Panda and Saining Xie},
year={2024},
eprint={2406.18533},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.18533},
}
Please use "black" with default settings to format the code if you want to contribute.
conda install black==24.4.2
Distributed under the Apache License Version 2.0 License. See LICENSE.txt
for more information.
- Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, July 2023. URL: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/.