[NeurIPS 2025 Spotlight] LaViDa:A Large Diffusion Language Model for Multimodal Understanding

[Paper] [Arxiv] [Checkpoints] [Data] [Website]

News

Dec 16 2025 We released the preprint and Project Page for Sparse-LaViDa, an efficient optimization technique for training and sampling from unified multi-modal dLLMs based on LaViDa.

Oct 2025: We opensourced LaViDa-O, a state-of-the-art unified multi-modal built on LaViDa

Sep 2025 We released the preprint for LaViDa-O, an extension of LaViDa to visual generation tasks.

Aug 2025: Our work was accepted to NeurIPS 2025 as a Spotlight Paper!

Installation

conda create --name lavida python=3.13
conda activate lavida
pip install -e .[train]
cd eval
pip install -e .
cd ../
pip install trl==0.17.0

Transformers-Compatible Checkpoint

For easy reproducibility, inference, and testing, we provide a Transformers-compatible checkpoint that does not require the source code to run. Please download checkpoints from Huggingface

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained('./lavida-llada-v1.0-instruct/')
model = AutoModelForCausalLM.from_pretrained('./lavida-llada-v1.0-instruct/', torch_dtype=torch.bfloat16)
image_processor = model.get_vision_tower().image_processor

model.resize_token_embeddings(len(tokenizer))
model.tie_weights()

Download Checkpoint

Please download checkpoints from Huggingface and organize them in the following structure

<repo root>
--lavida-ckpts # create this folder via mkdir
   --lavida-llada-hd # jacklishufan/lavida-llada-v1.0-instruct
   --lavida-dream-hd # jacklishufan/lavida-dream-v1.0-instruct
   --lavida-llada-hd-fim  # jacklishufan/lavida-llada-1.0-fim
   --lavida-llada-hd-reason # hbXNov/lavida-llada-reason
   --lavida-llada-lowres  # jacklishufan/lavida-llada-1.0-lowres

Inference

run the following script to perfom standard inference and text-infilling

python predict.py
python predict_fim.py

Evaluation

Reproduce Main Evaluation Results

Model	MME	MMMU	MMB	Latency (s/image)
LaViDa-Dream	1463.5	42.6	73.8	1.13
LaViDa-LLaDa	1365.6	43.3	70.5	1.32
MMaDa	1410.7	30.2	68.5	3.93

(speed measurement conducted with generation length=32 and steps=16)

The evaluation scrips are under eval folder. Please use the following script to reproduce the main results on MMMU.

bash eval/run.sh lavida-ckpts/lavida-llada-hd --tasks mmmu_val # for LaViDa-LLaDa
bash eval/run_dream.sh lavida-ckpts/lavida-dream-hd --tasks mmmu_val # for LaViDa-Dream

To reproduce results on other dataset, simply replace mmmu_val to respective dataset.

Reproduce COCO Caption Results (Speed-Quality Tradeoff)

bash eval/run_coco.sh lavida-ckpts/lavida-llada-hd

Model	KV Cache	CIDEr $\uparrow$	Latency $\downarrow$	NFE
LaviDa-LLaDa	off	110.2	6.65	100%
LaviDa-LLaDa	on	107.8	2.01	100%
LaviDa-LLaDa	off	108.5	3.57	50%
LaviDa-LLaDa	on	104.4	1.32	50%
LLaVa-1.6-7B (Baseline)	on	96.7	1.67	100%

We find that the low resolution model is slightly faster than HD model and have stronger performance on some tasks (e.g. COCO captioning). We provide the inference script as well.

bash eval/run_coco_lowres.sh lavida-ckpts/lavida-llada-lowres

Training

Data Preparation

The expected data folder structure looks like the following

<repo root>
--data
   --pretrain # LCS-558K
      -- images
      -- blip_laion_cc_sbu_558k.json
   --Open-LLaVA-NeXT
      -- ai2d
      -- ...
      -- open-llava-next 
   --infovqa-v1
   --VQAv2_train

Download LCS-558K and place it in data/pretrain
Download all datasets from Open-LLaVa-Next and place it in data/Open-LLaVa-Next
Download the remaining datasets from Our Huggingface. This dataset contains three subfolders

infovqa-v1 -> put under data/
VQAv2_train -> put under data/
open-llava-next -> put under data/Open-LLaVA-NeXT/, merge with an existing folder of same name

Launch Scripts

Pretrain(Stage 1) Scripts:
scripts/train/exps/cluster/pretrain_llada.sh
scripts/train/exps/cluster/pretrain_dream.sh

Finetune(Stage 2) Scripts

scripts/train/exps/cluster/llada-hd-llada-s2.sh
scripts/train/exps/cluster/llada-hd-dream-s2.sh

To launch finetuning scripts, you need to change the BASE_RUN_NAME variable in the shell scripts to the path of stage 1 checkpoints. If you want to directly launch stage 2 training, we provide pretrained stage 1 checkpoints in the link Stage-1-LLaDa and Stage-1-Dream

Common Questions

Why validation acc is 0 during the training

We observed a bug with deepspeed Zero-3 that it breaks inference for validation. Hence, if you want to start a training run with eval results logged to wandb, please use Zero-2.

Where is Reasoning Data used in Stage 3

It can be fund in the huggingface collection

How to launch FIM training

The scripts is in scripts/train/exps/cluster/llada-hd-llada-s3-fim.sh

Acknowledgements

This repo is largely based on LLaVa-Next. We use LMMS-Eval for evaluations.

Citation

@inproceedings{lilavida,
  title={LaViDa: A Large Diffusion Model for Vision-Language Understanding},
  author={Li, Shufan and Kallidromitis, Konstantinos and Bansal, Hritik and Gokul, Akash and Kato, Yusuke and Kozuka, Kazuki and Kuen, Jason and Lin, Zhe and Chang, Kai-Wei and Grover, Aditya},
  booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
  year={2025}
}
@article{li2025lavida,
  title={Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation},
  author={Li, Shufan and Gu, Jiuxiang and Liu, Kangning and Lin, Zhe and Wei, Zijun and Grover, Aditya and Kuen, Jason},
  journal={arXiv preprint arXiv:2509.19244},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
docs		docs
eval		eval
images		images
llava		llava
paper		paper
playground		playground
scripts		scripts
trl		trl
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
predict.py		predict.py
predict_fim.py		predict_fim.py
predict_fim_2.py		predict_fim_2.py
predict_fim_3.py		predict_fim_3.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[NeurIPS 2025 Spotlight] LaViDa:A Large Diffusion Language Model for Multimodal Understanding

News

Installation

Transformers-Compatible Checkpoint

Download Checkpoint

Inference

Evaluation

Reproduce Main Evaluation Results

Reproduce COCO Caption Results (Speed-Quality Tradeoff)

Training

Data Preparation

Launch Scripts

Common Questions

Why validation acc is 0 during the training

Where is Reasoning Data used in Stage 3

How to launch FIM training

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

jacklishufan/LaViDa

Folders and files

Latest commit

History

Repository files navigation

[NeurIPS 2025 Spotlight] LaViDa:A Large Diffusion Language Model for Multimodal Understanding

News

Installation

Transformers-Compatible Checkpoint

Download Checkpoint

Inference

Evaluation

Reproduce Main Evaluation Results

Reproduce COCO Caption Results (Speed-Quality Tradeoff)

Training

Data Preparation

Launch Scripts

Common Questions

Why validation acc is 0 during the training

Where is Reasoning Data used in Stage 3

How to launch FIM training

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages