Junhao Cheng1†,
Liang Hou2,
Xin Tao2,
Jing Liao1
1City University of Hong Kong 2Kling Team, Kuaishou Technology
† This work was conducted during the author's internship at Kling Team, Kuaishou Technology
We pioneer Video-Next-Event Prediction (VNEP), extending text-based next-event prediction to dynamic video responses. This shift from telling to showing enables more intuitive and customized answers for procedural learning and creative exploration.
To tackle VNEP, we propose VANS, a model that aligns a Vision-Language Model (VLM) with a Video Diffusion Model (VDM) through our Joint-GRPO post-training approach. Our method bridges the semantic-to-visual gap of VLM and VDM, enabling high-quality video event prediction and generation.
VANS Architecture: Dual-path processing with VLM for reasoning and VDM for generation |
Joint-GRPO: Two-stage co-steering optimization |
VANS Architecture: Processes input videos and questions through dual pathways:
- VLM Path: Performs instruction-grounded reasoning to generate textual captions
- VDM Path: Synthesizes videos conditioned on semantic captions and visual context
Joint-GRPO: Our two-stage reinforcement learning approach:
- Stage 1: Visualization-friendly VLM tuning - optimizes captions for visual plausibility
- Stage 2: Context-faithful VDM adaptation - ensures semantic alignment and visual coherence
Same input video, different questions lead to diverse future predictions:
| Input Video | ||
|
||
![]() "What if she gets burned in her daily life?" |
![]() "What if she gets burned in an exaggerated movie?" |
![]() "What if she eats something spicy in an exaggerated movie?" |
| Input Video | ||
|
||
![]() "Show her reaction if she sees her grandson." |
![]() "Show her reaction if she sees her husband." |
![]() "Show her reaction if she sees the personification of death." |
To set up the environment for inference, you can run the following command:
git clone https://github.com/KlingTeam/VANS.git
cd VANS
conda create -n VANS python==3.12 -y
conda activate VANS
pip install requirements.txt
cd vans/models_mllm/qwen-vl-utils
pip install -e .[decord]
cd ...To get started, download the VANS base models:
- Qwen2.5-VL-3B - The Vision-Language Model
- Wan2.1-T2V-1.3B - The Video Diffusion Model
Then download the complete VANS model:
VANS Model Download (Coming Soon)
To run local gradio demo:
python app.py- Release VANS-Data-100K dataset
- Release VANS model
- Release training codes
- Release inference codes
- Release paper
If you find our work helpful, please consider giving a star 🌟 and citation 📝
@article{cheng2025video,
title={Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO},
author={Cheng, Junhao and Hou, Liang and Tao, Xin and Liao, Jing},
journal={arXiv preprint arXiv:2511.16669},
year={2025}
}
















