This is a curated list of recent visual autoregressive modeling works, including image/video/3D/multi-modal generation but not limited to these. It aims to include all the relevant latest papers about visual autoregressive to save you time. Any suggestions and pull requests are welcomed!
-
Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey
-
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
-
RQ-VAE:Autoregressive Image Generation using Residual Quantization
-
Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization
-
MAGVIT-v2:Language Model Beats Diffusion: Tokenizer is key to visual generation
-
LlamaGen: Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
-
VAR: Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction
-
MAR: Autoregressive Image Generation without Vector Quantization
-
SAR: Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling
-
STAR: Scale-wise Text-to-image generation via Auto-Regressive representations
-
Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens
-
Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis
-
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
-
Taming Scalable Visual Tokenizer for Autoregressive Image Generation
-
Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation
-
TiTok: An Image is Worth 32 Tokens for Reconstruction and Generation
-
XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation
-
ImageFolder: Autoregressive Image Generation with Folded Tokens
-
DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation
-
M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation
-
CoDe: Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient
-
VAR-CLIP: Text-to-Image Generator with Visual Auto-Regressive Modeling
-
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
-
MARS: Mixture of Auto-Regressive Models for Fine-grained Text-to-image Synthesis
-
FlowAR: Scale-wise Autoregressive Image Generation Meets Flow Matching
-
CAR: Controllable Autoregressive Modeling for Visual Generation
-
CCA: Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment
-
ControlVAR: Exploring Controllable Visual Autoregressive Modeling
-
DnD-Transformer: A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Fine-grained Image Generation
-
EditAR: Unified Conditional Generation with Autoregressive Models
-
LiteVAR: Compressing Visual Autoregressive Modelling with Efficient Attention and Quantization
-
XTRA: Sample- and Parameter-Efficient Auto-Regressive Image Models
-
Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis
-
ZipAR: Accelerating Auto-regressive Image Generation through Spatial Locality
-
E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling
-
NPP: Next Patch Prediction for Autoregressive Visual Generation
-
Distilled Decoding 1: One-step Sampling of Image Auto-regressive Models with Flow Matching
-
X-Prompt: Towards Universal In-Context Image Generation in Auto-Regressive Vision Language Foundation Models
-
StarGen: A Spatiotemporal Autoregression Framework with Video Diffusion Model for Scalable and Controllable Scene Generation
-
IAR: Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction
-
ViTok: Learnings from Scaling Visual Tokenizers for Reconstruction and Generation
-
FlexTok: Resampling Images into 1D Token Sequences of Flexible Length
-
IGTR: Autoregressive Image Generation Guided by Chains of Thought
-
Beyond Next-Token: Next-X Prediction for Autoregressive Visual Generation
-
UniTok: A Unified Tokenizer for Visual Generation and Understanding
-
FlexVAR: Flexible Visual Autoregressive Modeling without Residual Prediction
-
FAR: Frequency Autoregressive Image Generation with Continuous Tokens
-
DAR: Direction-Aware Diagonal Autoregressive Image Generation
-
Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
-
DiCoDe: Diffusion-Compressed Deep Tokens for Autoregressive Video Generation with Language Models
-
NOVA: Autoregressive Video Generation without Vector Quantization
-
CausVid: From Slow Bidirectional to Fast Autoregressive Video Diffusion Models
-
An Empirical Study of Autoregressive Pre-training from Videos
-
CTF: Taming Teacher Forcing for Masked Autoregressive Video Generation
-
Next Block Prediction: Video Generation via Semi-Autoregressive Modeling
-
SAR3D: Autoregressive 3D Object Generation and Understanding via Multi-scale 3D VQVAE
-
TAR3D: Creating High-quality 3D Assets via Next-Part Prediction
-
ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
-
Show-o: One Single Transformer To Unify Multimodal Understanding and Generation
-
SEED-X: Multimodal Models with Unified Multi-granularity Comprehension and Generation
-
(LlamaFusion)LMFusion: Adapting Pretrained Language Models for Multimodal Generation
-
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning
-
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
-
TokenFlow: Unified Image Tokenizer for Multimodal Understanding and Generation
-
JetFormer: An Autoregressive Generative Model of Raw Images and Text
-
VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
-
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
-
Dual Diffusion for Unified Image Generation and Understanding
-
VARGPT: Unified Understanding and Generation in a Visual Autoregressive Multimodal Large Language Model
-
QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation
