We present Query3D, a novel method for open-vocabulary 3D scene querying in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs).

Abstract: This paper introduces a novel method for open-vocabulary 3D scene querying in autonomous driving by combining Language Embedded 3D Gaussians with Large Language Models (LLMs). We propose utilizing LLMs to generate both contextually canonical phrases and helping positive words for enhanced segmentation and scene interpretation. Our method leverages GPT-3.5 Turbo as an expert model to create a high-quality text dataset, which we then use to fine-tune smaller, more efficient LLMs for on-device deployment. Our comprehensive evaluation on the WayveScenes101 dataset demonstrates that LLM-guided segmentation significantly outperforms traditional approaches based on predefined canonical phrases. Notably, our fine-tuned smaller models achieve performance comparable to larger expert models while maintaining faster inference times. Through ablation studies, we discover that the effectiveness of helping positive words correlates with model scale, with larger models better equipped to leverage additional semantic information. This work represents a significant advancement towards more efficient, context-aware autonomous driving systems, effectively bridging 3D scene representation with high-level semantic querying while maintaining practical deployment considerations.
This repository contains code for LLM-powered semantic querying of 3D Gaussian Splatting scenes. Built upon the LEGaussian framework, this project extends the capabilities by integrating Large Language Models (LLMs) like Qwen and Llama for semantic scene understanding.
Our project enables semantic querying of 3D Gaussian Splatting scenes using natural language through LLM integration. Users can query scene elements, and generate targeted visualizations through natural language interactions.
The main dataset used in this project is WayveScenes 101, which can be downloaded from the official repository.
For convenience, we provide pre-processed data used in our paper:
- Subset of scenes used in experiments
- Pre-extracted language features
- Pre-computed codebooks
Download link: Google Drive Link
After downloading, place the data in the following structure:
data/
├── wayvescene/
│ ├── scene_xxx/
│ │ ├── images/
│ │ ├── sparse/
│ │ ├── xxx_encoding_indices.pt
│ │ └── xxx_codebook.pt
Note: Please cite the original WayveScenes paper if you use their dataset in your research.
This project uses two Docker environments: one for the main Gaussian Splatting framework and another for LLM finetuning.
- Clone the repository:
git clone https://github.com/AmirhoseinCh/Query-3DGS-LLM.git
cd Query-3DGS-LLM
- Build the main Docker image:
docker build -t legaussians .
- Run the container:
./run.sh
The run.sh
script will set up the Docker environment and start the container with all necessary dependencies installed.
For LLM finetuning, we use unsloth, which requires a separate environment:
- Navigate to the unsloth directory:
cd unsloth
- Build the unsloth Docker image:
docker build -t unsloth .
- Run the unsloth container:
./run_unsloth.sh
This will set up the environment specifically for finetuning LLMs (Qwen and Llama models).
Note: Make sure you have Docker installed on your system before proceeding with either installation.
We extract and process features from multi-view images following these steps:
- Extract dense CLIP and DINO features from multi-view images
- Concatenate them as dense features
- Quantize the features and save:
- Feature indices (
xxx_encoding_indices.pt
) - Codebook (
xxx_codebook.pt
)
- Feature indices (
To preprocess the images:
cd preprocess
python quantize_features.py --config configs/wayvescene/xxx.cfg
Configuration files for specific scenes can be found in ./preprocess/configs/wayvescene
. You can modify these configs for other scenes or datasets.
Train the model using the train.py
script. Config files specify:
- Data and output paths
- Training hyperparameters
- Test set
- Language feature indices path
python train.py --config configs/wayvescene/xxx.cfg
Training configs for the WayveScenes 101 dataset are located in ./configs/wayvescene
.
You can render scenes using the batch rendering script for multiple scenes:
./render_scenes.sh
The rendering process generates:
- RGB images
- Relevancy maps of text queries
- Segmentation masks
Run evaluation on the rendered results using:
./eval_scenes.sh
This script will evaluate the model's performance across all specified scenes and generate metrics including:
- Mean accuracy
- Mean IoU
- Mean precision
We support multiple LLM models:
- Qwen 2.5 series (0.5B, 1.5B, 3B, 7B)
- Llama series (1B, 3B, 8B)
The project uses unsloth for efficient LLM finetuning:
- Start the unsloth Docker container:
cd unsloth
./run_unsloth.sh
- Run the finetuning process using the provided Jupyter notebook:
finetune.ipynb
The notebook contains all necessary steps and instructions for finetuning both Qwen and Llama models.
Finetuned models will be saved in the respective output directories based on the model size and type (e.g., outputs-3B/
for Qwen 2.5 3B model).
We welcome contributions! Please feel free to submit a Pull Request.
If you use this code in your research, please cite our work:
@misc{chahe2024query3dllmpoweredopenvocabularyscene,
title={Query3D: LLM-Powered Open-Vocabulary Scene Segmentation with Language Embedded 3D Gaussian},
author={Amirhosein Chahe and Lifeng Zhou},
year={2024},
eprint={2408.03516},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.03516},
}
This work builds upon the LEGaussian implementation. We thank the original authors for making their code available.
[LICENSE] - See LICENSE file for details