Skip to content

xiaomi-research/shuffle-r1

Repository files navigation

[ICLR 2026] Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle

Paper Hugging Face Model Hugging Face Dataset

[2026.01] Shuffle-R1 is accepted by ICLR 2026!

[2025.08] Our code, model checkpoint and dataset are open-sourced, Check them out!

Official code repository of Shuffle-R1.

Project website: https://xenozlh.github.io/Shuffle-R1/

Introduction

Shuffle-R1 is a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces two key modules:

  • Pairwise Trajectory Sampling (PTS)
  • Advantage-based Batch Shuffle (ABS)

Experiments across multiple reasoning benchmarks demonstrate that our framework consistently outperforms strong RL baselines with minimal computational overhead. Specifically, Shuffle-R1 achieves superior performance against GRPO while using only half of the training steps under same settings.

TL;DR: We propose Shuffle-R1, a simple and effective RL post-training framework for MLLM that significantly improves RL training efficiency and model performance.

Release

  • model checkpoints (3B and 7B)
  • datasets
  • training scripts
  • inference scripts via Transformers and vLLM
  • evaluation scripts

Framework Overview

Framework Overview

Performance Overview

Model MathVerse MathVision MathVista (mini) WeMath (loose) HallusionBench ChartQA Avg.
Qwen2.5-VL-3B 34.8 21.9 58.4 51.7 59.8 73.1 49.9
Qwen2.5-VL-7B 42.6 25.8 67.4 63.5 65.2 79.8 57.4
Shuffle-R1-3B 44.2 26.8 70.4 66.5 69.2 79.9 59.5
Shuffle-R1-7B 53.9 30.0 77.0 72.3 71.0 84.1 64.7

All models are evaluated under CoT prompt.

Try our model

3B checkpoint link: Shuffle-R1-Qwen-3B 7B checkpoint link: Shuffle-R1-Qwen-7B

Using Transformers

The process is the same as Qwen2.5-VL. Note that it is better to add a "Thinking prompt" at the begining of user query.

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "path/to/your/checkpoint"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

processor = AutoProcessor.from_pretrained(model_path)

system_prompt = """
You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}.
"""

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/your/image"},
            {"type": "text", "text": system_prompt + "YOUR TEXT QUERY HERE"},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Using vLLM

We also provide scripts for vLLM inference. You can run the following command to inference with vLLM:

python inference/infer_vllm.py \
    --model path/to/your/checkpoint \
    --output-dir path/to/save/outputs \
    --input-file path/to/your/input/file.jsonl \
    --tensor-parallel-size 1 \
    --min-pixels 262144 \
    --max-pixels 4194304 \
    --max-model-len 8192 \
    --temperature 0.5

The inference scripts support batch inference. You can organize your inference data in a jsonl file like this:

[
    {'image_path': "path/to/image/1", 'question': "question 1"},
    {'image_path': "path/to/image/2", 'question': "question 2"},
    ...
] 

Install

Our code is based on EasyR1. Our code follows a non-intrusive design, which keeps the original functions of EasyR1 unchanged.

For environment installation, you can:

git clone https://github.com/xiaomi-research/shuffle-r1.git
cd shuffle-r1
pip install -e .

Training

Dataset Preparation

Download our training data at here.

The training data contains 2.1k samples from Geometry3K and 27k random selected samples from MM-EUREKA dataset. Each sample in the dataset follows the format below:

{
    "problem": "your problem",  # type: str
    "images": [{"bytes": image_bytes, "path": None}],  # type: list[dict]
    "answer": "your answer",  # type: str
    "source": "data source"  # type: str, not used in training
}

Custom Dataset Format

Supported dataset format is the same as EasyR1. You can organize your dataset in the same format as illustrated above. Refer to here for more information.

Training Script

bash examples/qwen2_5_vl_3b.sh  # 3B model training 
bash examples/qwen2_5_vl_7b.sh  # 7B model training

All training are conducted on 8x H800-80G GPUs.

Evaluation

Download the evaluation benchmark at here. We use Gemini-2.0-flash to evaluate the model response for certain benchmakrs. Make sure to adjust the llm_eval_score_retry function in evaluation/utils/model_parser.py to enable your own API service before evaluation.

cd evaluation
bash eval.sh  # start evaluation

Refer to evaluation/eval.sh for more details.

Acknowledgement

Our work benefit from the following open-source projects:

Citation

If you find our work useful for your research, please consider citing:

@misc{zhu2025shuffler1,
      title={Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle}, 
      author={Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai},
      year={2025},
      eprint={2508.05612},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2508.05612}, 
}

About

Official code repository of Shuffle-R1

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages