Skip to content

FlagOpen/RoboBrain-X0

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RoboBrain-X0: A Unified Cross-Embodiment Vision-Language-Action Model for Token Reasoning and Action Generation.

  ⭐️ Project   |   🤗 Hugging Face   |   🤖 ModelScope  

OmniSAT: Compact Action Token, Faster Auto Regression for RoboBrain-X0 Pretraining.

🚀 RoboBrain 2.0: See Better, Think Harder, Do Smarter beyond RoboBrain1.0.

⭐️ Reason-RFT: Core Post-Training Strategy for Embodied Visual Reasoning in RoboBrain2.0.

🎯 RoboOS: An Efficient Open-Source Multi-Robot Coordination System for RoboBrain.

🌍 RoboBrain 1.0: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete.

💬 If you have any questions, feel free to contact us via WeChat or RedNote.

🔥 Overview

We are thrilled to introduce RoboBrain-X0, a groundbreaking cross-ontology foundation model designed to overcome the limitations of single-robot systems in heterogeneous ontology transfer. By leveraging End-Effector pose representation in SE(3) task space, coupled with a Unified Action Vocabulary (UAV) and action tokenizer, RoboBrain-X0 achieves efficient zero-shot generalization and complex task decomposition. Its Grouped Residual Quantizer (GRVQ) maps continuous control sequences from diverse degrees of freedom and mechanical structures to a shared discrete action primitive space, ensuring semantic consistency and transferability across ontologies such as AgileX, R1-Lite dual-arm wheeled robots, and Franka single-arm systems. Through ontology-conditioned diverse prompting, the model supports flexible decoding from multi-view RGB-D inputs to specific executions, significantly reducing training and inference overhead. RoboBrain-X0 delivers state-of-the-art performance in embodied reasoning tasks, laying a robust foundation for developing versatile, real-world robotic agents and advancing embodied intelligence research.

🗞️ News

📆 Todo

  • Release model checkpoint for RoboBrain-X0-Preview based on Fast Tokenizer
  • Release quick inference example for RoboBrain-X0
  • Release training and evaluation codes for RoboBrain-X0
  • Release omnisat-based RoboBrain-X0 full version (coming soon)

🚀 Features

RoboBrain-X0 supports the unified modeling of heterogeneous ontologies and offers zero-shot generalization and complex task decomposition capabilities. Building on RoboBrain's multimodal foundation, RoboBrain-X0 further integrates real-world robot motion data based on RoboBrain-2.0 data. By unifying vision, language, and motion modeling, it achieves cross-ontology generalization and adaptation, providing integrated capabilities from perception to execution.

⭐️ Architecture

This model includes RoboBrain 2.0 and OmniSAT (action tokenizer). Based on RoboBrain 2.0, the model is trained on a large amount of real-world robotics data and embodied reasoning data, enabling it to possess general robotic manipulation capabilities. The action token sequences output by the model are converted into underlying robot control signals through our proprietary action tokenizer. Model details are as follows:

  • Multimodal Input: The model accepts single-image, multi-image, and text input (covering pointing task scenarios, object maneuverability scenarios, trajectory scenarios, and subtask execution scenarios), and produces outputs of varying dimensions based on the input scenarios.
  • Action Generation and Execution: After model processing, OmniSAT converts these into multi-degree-of-freedom (DoF) action sequences, ultimately driving the robot to complete the operation.

🤗 Model Zoo

Models Checkpoint Description
RoboBrain-X0-preview 🤗 BAAI/RoboBrain-X0-Preview preview version of the RoboBrain-X0
RoboBrain-X0-FlagOS 🤗 FlagRelease/RoboBrain-X0-FlagOS multi-chip version of the RoboBrain-X0
RoboBrain-X0-Dataset 🤗 BAAI/RoboBrain-X0-Dataset training dataset of RoboBrain-X0

🛠️ Setup

# Pull Docker Image.
docker pull harbor.baai.ac.cn/flagrelease-public/robobrain_x0_flagscale
# Backup link:
# docker pull harbor.baai.ac.cn/flagrelease-public/robobrain_x0_flagscale:cuda12.4.1-cudnn9.5.0-python3.12-torch2.6.0-time250928-ssh

# Run Container.
docker run -itd \
  --name robotics_pretrain \
  --privileged \
  --gpus all \
  --net=host \
  --ipc=host \
  --device=/dev/infiniband \
  --shm-size 512g \
  --ulimit memlock=-1 \
  -v /nfs/hcr/models/:/models \
  ghcr.io/robobrain-roboos-robotic/robotics_pretrain_flagscale:cuda12.4.1-cudnn9.5.0-python3.12-torch2.6.0-time250928-ssh

🤖 Data Processing

Take the open-source Agilex data from RoboBrain-X0-Dataset as an example.

  1. Run normal.py to generate the normalization parameter file:
python data_process/normal.py

This step will produce the parameter file and action data distribution plots.

  1. Update data_process/data_process.sh with the correct file path and process number, then run it to generate training data in parallel:
bash data_process/data_process.sh

🤖 Training

cd /root/robotics_pretrain/flag-scale
conda activate flagscale-train

python run.py \
  --config-path ./examples/qwen2_5_vl/conf \
  --config-name train_3b_action_S6_subtask_agilex_eval5_demo \
  action=run

🔍 Evaluation

The evaluation on real robots adopts a "server-client" architecture, with a PC acting as a data transfer hub: the model is deployed on a remote server (responsible for inference), while the client (running on the robot side) sends the robot's state to the server. After the model performs inference, it returns the generated action results to the client, which then controls the robot to execute these actions.

Real Robots Evaluation Architecture Overview

The system operates on a simple yet effective server-client model:

  1. Server (Inference Side): A remote machine (usually with a powerful GPU) where the trained model is deployed.
    • It receives the robot's state and images.
    • It performs inference to decide on the next sequence of actions.
    • It sends the generated actions back to the client in the HTTP response.
  2. Client (Robot Side): The Python script running on the robot's control computer.
    • It continuously captures the robot's current state (e.g., end-effector pose, joint angles) and images from its cameras.
    • It packages this observation data into a JSON payload.
    • It sends the data to the remote inference server via an HTTP POST request.

This decoupling allows for running complex, resource-intensive models without needing high-end computing hardware directly connected to the robot.

┌──────────────────┐      ┌───────────────────────────┐      ┌────────────────────┐
│                  │      │      Control PC (Client)  │      │                    │
│   Real Robot     ├─────►│ 1. Get State & Images     ├─────►│  Inference Server  │
│                  │      │ 2. Send Request           │      │  (Model Deployed)  │
│                  │      │                           │      │                    │
│                  │◄─────┤ 4. Execute Actions        │◄─────┤ 3. Return Actions  │
└──────────────────┘      └───────────────────────────┘      └────────────────────┘

How to Run a Server

Each server can be configured and run independently. Here are the general instructions.

  1. Choose a server script (e.g., serve_franka.py).
  2. Configure the mode: Open the script and set the SUBTASK_MODE variable:
    • SUBTASK_MODE = False for standard action generation.
    • SUBTASK_MODE = True for subtask + action generation.
  3. Verify paths: Ensure that MODEL_PATH and STATS_PATH (if applicable) point to the correct files for your chosen mode.
  4. Start the server:
    # Example for Franka server
    python serve_franka.py

Examples

Inference code examples are provided for four robot platforms: A2D, Franka, Agilex, and R1lite (ROS1 and ROS2 version).

  • Agilex
  • Franka
  • R1-Lite
  • A2D

📑 Citation

If you find this project useful, welcome to cite us.

@misc{lyu2025omnisatcompactactiontoken,
      title={OmniSAT: Compact Action Token, Faster Auto Regression}, 
      author={Huaihai Lyu and Chaofan Chen and Senwei Xie and Pengwei Wang and Xiansheng Chen and Shanghang Zhang and Changsheng Xu},
      year={2025},
      eprint={2510.09667},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.09667}, 
}
@article{RoboBrain2.0TechnicalReport,
    title={RoboBrain 2.0 Technical Report},
    author={BAAI RoboBrain Team},
    journal={arXiv preprint arXiv:2507.02029},
    year={2025}
}
@article{zhou2025roborefer,
    title={RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics},
    author={Zhou, Enshen and An, Jingkun and Chi, Cheng and Han, Yi and Rong, Shanyu and Zhang, Chi and Wang, Pengwei and Wang, Zhongyuan and Huang, Tiejun and Sheng, Lu and others},
    journal={arXiv preprint arXiv:2506.04308},
    year={2025}
}
@article{Reason-RFT,
    title={Reason-rft: Reinforcement fine-tuning for visual reasoning},
    author={Tan, Huajie and Ji, Yuheng and Hao, Xiaoshuai and Lin, Minglan and Wang, Pengwei and Wang, Zhongyuan and Zhang, Shanghang},
    journal={arXiv preprint arXiv:2503.20752},
    year={2025}
}
@article{RoboBrain1.0,
    title={Robobrain: A unified brain model for robotic manipulation from abstract to concrete},
    author={Ji, Yuheng and Tan, Huajie and Shi, Jiayu and Hao, Xiaoshuai and Zhang, Yuan and Zhang, Hengyuan and Wang, Pengwei and Zhao, Mengdi and Mu, Yao and An, Pengju and others},
    journal={arXiv preprint arXiv:2502.21257},
    year={2025}
}




About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5