Chat-Scene

We build a multi-modal large language model for 3D scene understanding, excelling in tasks such as 3D grounding, captioning, and question answering.

🔥 Ranked 1st on the ScanRefer Benchmark (Sept. 2024)

leaderboard link

🔥 Ranked 1st on the Scan2Cap Benchmark (Sept. 2024)

leaderboard link

News

[2024.09] 🔥 Chat-Scene has been accepted by NeurIPS 2024! [paper]

[2024.08] 🔥 We release Chat-Scene, capable of processing both 3D point clouds and 2D multi-view images for improved 3D scene understanding, leading to significant advancements in grounding and captioning performance.

[2024.04] We release a refined implementation (v2.1), which achieves better performance on grounding, captioning, and QA tasks. The code is available in branch v2.1.

[2023.12] We release Chat-3D v2 [paper], introducing object identifiers for enhanced object referencing and grounding in 3D scenes. The original code is available in branch v2.0.

[2023.08] We release Chat-3D [paper] [code], an LLM-based dialogue system for 3D scenes.

🔥 Chat-Scene vs Chat-3D v2

Performance Comparison

	ScanRefer		Multi3dRefer		Scan2Cap		ScanQA		SQA3D
	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	[email protected]	CIDEr	B-4	EM
v2.0	35.9	30.4	-	-	28.1	15.5	77.1	7.3	-
v2.1	42.5	38.4	45.1	41.6	63.9	31.8	87.6	14.0	54.7
Chat-Scene	55.5	50.2	57.1	52.4	77.1	36.3	87.7	14.3	54.6

*The v2.1 and Chat-Scene results are based on single models without task-specific finetuning.

Main Changes
New features in Chat-Scene
- Introduce a 2D token for each object, with 2D representations extracted from multi-view images using DINOv2.
- Enable processing of 2D ego-centric video using a tracking-based detector when 3D input is unavailable.
New features in v2.1 (Chat-Scene is built upon v2.1)
- LLM backbone: Vicuna v0 -> Vicuna v1.5 + LoRA.
- Training scheme: three-stage training -> one-stage joint training.
- Detector: PointGroup -> Mask3D.
- Code Optimization:
  - batch size: 1 -> 32.
  - Simplified training and evaluation processes.

🔨 Preparation

Prepare the environment:

conda create -n chat-scene python=3.9.17
conda activate chat-scene
conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt

Download LLM backbone:
- We use Vicuna-7B v1.5 in our experiments, which can be downloaded from Hugging Face.
- Change the llama_model_path in config.py to the path of vicuna-7b-v1.5.
Annotations and extracted features:

Please follow the instructions in preprocess.

🤖 Training and Inference

Training
- Modify run.sh:
```
train_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref#nr3d_caption#obj_align"
val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=False
```
  Explanation of "train_tag" and "val_tag"
  - Use # to seperate different datasets
  - Datasets:
    - scanrefer: ScanRefer Dataset
    - scan2cap: Scan2Cap Dataset
    - scanqa: ScanQA Dataset
    - sqa3d: SQA3D Dataset
    - multi3dref: Multi3dRefer Dataset
    - nr3d_caption: A captioning dataset originated from Nr3D.
    - obj_align: A dataset originated from ScanRefer to align the object identifiers with object tokens.
- Run: bash scripts/run.sh

Inference

Modify run.sh: (We provide the pretrained checkpoint in Google Drive)

val_tag="scanrefer#scan2cap#scanqa#sqa3d#multi3dref"
evaluate=True
pretrained_path="/path/to/pretrained_model.pth"

Run: bash scripts/run.sh

📄 Citation

If you find this project useful in your research, please consider cite:

@article{huang2024chat,
  title={Chat-scene: Bridging 3d scene and large language models with object identifiers},
  author={Huang, Haifeng and Chen, Yilun and Wang, Zehan and Huang, Rongjie and Xu, Runsen and Wang, Tai and Liu, Luping and Cheng, Xize and Zhao, Yang and Pang, Jiangmiao and others},
  journal={Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada},
  year={2024}
}
@article{wang2023chat,
  title={Chat-3d: Data-efficiently tuning large language model for universal dialogue of 3d scenes},
  author={Wang, Zehan and Huang, Haifeng and Zhao, Yang and Zhang, Ziang and Zhao, Zhou},
  journal={arXiv preprint arXiv:2308.08769},
  year={2023}
}

Stay tuned for our project. 🔥

If you have any questions or suggestions, feel free to drop us an email ([email protected], [email protected]) or open an issue.

😊 Acknowledgement

Thanks to the open source of the following projects:

(Multi-modal) LLMs: LLaMA, Vicuna, VideoChat, LEO

3D Datasets: ScanNet, ScanRefer, ReferIt3D, Scan2Cap, ScanQA, SQA3D, Multi3dRefer

Detectors: PointGroup, Mask3D, DEVA

Representations: ULIP, Uni3D, DINOv2

3D Models: vil3dref, OpenScene

Name		Name	Last commit message	Last commit date
Latest commit History 86 Commits
assets		assets
dataset		dataset
models		models
others		others
preprocess		preprocess
prompts		prompts
scripts		scripts
tasks		tasks
utils		utils
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chat-Scene

News

🔥 Chat-Scene vs Chat-3D v2

🔨 Preparation

🤖 Training and Inference

📄 Citation

😊 Acknowledgement

About

Releases

Packages

Languages

License

HBA123/Chat-Scene

Folders and files

Latest commit

History

Repository files navigation

Chat-Scene

News

🔥 Chat-Scene vs Chat-3D v2

🔨 Preparation

🤖 Training and Inference

📄 Citation

😊 Acknowledgement

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages