Skip to content

OpenGVLab/Vlaser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

[📜 Paper] [⭐️Project Page] [🤗 Model]

overview

⭐️ Introduction

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser -- a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks—including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

🗞️ News

  • 2025-10-13: 🤖 We release Vlaser VLM model (Vlaser-2B and Vlaser-8B) as well as VLA model (Vlaser-2B-VLA) on 🤗Vlaser.
  • 2025-10-13: 🤖 We release the training and inference code of Vlaser VLM based on InternVL3.

📆 Todo

  • Release Vlaser-2B and Vlaser-8B ckpt for VLM embodied reasoning.
  • Release Vlaser-2B-VLA model for end-to-end robot control in SimplerEnv (WidowX and Google Robot).
  • Release the training and evaluation code for Vlaser VLMs.
  • Release the training and evaluation code for Vlaser VLAs.
  • Release the Dataset Generation Pipeline.
  • Release the Vlaser-6M Dataset.

Vlaser VLM Quick Start

Please refer to Vlaser_VLM for details.

🎫 License

This project is released under the MIT License.

🖊️ Citation

If you find this work helpful in your research, please consider giving this repo a star ⭐ and citing our paper:

@article{luo2025visual,
  title={Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces},
  author={Luo, Gen and Yang, Ganlin and Gong, Ziyang and Chen, Guanzhou and Duan, Haonan and Cui, Erfei and Tong, Ronglei and Hou, Zhi and Zhang, Tianyi and Chen, Zhe and others},
  journal={arXiv preprint arXiv:2506.00123},
  year={2025}
}

About

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published