Awesome Visual Autoregressive

This is a curated list of recent visual autoregressive modeling works, including image/video/3D/multi-modal generation but not limited to these. It aims to include all the relevant latest papers about visual autoregressive to save you time. Any suggestions and pull requests are welcomed!

Survey

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Image Generation

MaskGIT: Masked Generative Image Transformer
MAGVIT: Masked Generative Video Transformer
RQ-VAE：Autoregressive Image Generation using Residual Quantization
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization
MAGVIT-v2：Language Model Beats Diffusion: Tokenizer is key to visual generation
LlamaGen: Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
VAR: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
MAR: Autoregressive Image Generation without Vector Quantization
SAR: Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
Taming Scalable Visual Tokenizer for Autoregressive Image Generation
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
DARL: Denoising Autoregressive Representation Learning
TiTok: An Image is Worth 32 Tokens for Reconstruction and Generation
XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation
ImageFolder: Autoregressive Image Generation with Folded Tokens
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
RAR: Randomized Autoregressive Visual Generation
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
CAR: Controllable Autoregressive Modeling for Visual Generation
CCA: Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment
Scalable Autoregressive Image Generation with Mamba
ControlVAR: Exploring Controllable Visual Autoregressive Modeling
DnD-Transformer: A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Fine-grained Image Generation
EditAR: Unified Conditional Generation with Autoregressive Models
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization
XTRA: Sample- and Parameter-Efficient Auto-Regressive Image Models
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling
PAR: Parallelized Autoregressive Visual Generation
NPP: Next Patch Prediction for Autoregressive Visual Generation
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
IAR: Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
ViTok: Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
Fractal Generative Models
IGTR: Autoregressive Image Generation Guided by Chains of Thought
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
UniTok: A Unified Tokenizer for Visual Generation and Understanding
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
FAR: Frequency Autoregressive Image Generation with Continuous Tokens
DAR: Direction-Aware Diagonal Autoregressive Image Generation

Video Generation

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
NOVA: Autoregressive Video Generation without Vector Quantization
CausVid: From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
An Empirical Study of Autoregressive Pre-training from Videos
CTF: Taming Teacher Forcing for Masked Autoregressive Video Generation
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling

3D Generation

SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
TAR3D: Creating High-quality 3D Assets via Next-Part Prediction
ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Multi-Modal

Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
Show-o: One Single Transformer To Unify Multimodal Understanding and Generation
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
DreamLLM: Synergistic Multimodal Comprehension and Creation
(LlamaFusion)LMFusion: Adapting Pretrained Language Models for Multimodal Generation
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Emu3: Next-Token Prediction is All You Need
Liquid: Language Models are Scalable Multi-modal Generators
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
JetFormer: An Autoregressive Generative Model of Raw Images and Text
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Dual Diffusion for Unified Image Generation and Understanding
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation

Autonomous Driving

DrivingWorld: Constructing World Model for Autonomous Driving via Video GPT
DrivingGPT: Unifying Driving World Modeling and Planning with Multi-modal Autoregressive Transformers

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
asset		asset
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Visual Autoregressive

Survey

Image Generation

Video Generation

3D Generation

Multi-Modal

Autonomous Driving

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Visual Autoregressive

Survey

Image Generation

Video Generation

3D Generation

Multi-Modal

Autonomous Driving

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages