[2026.01] Shuffle-R1 is accepted by ICLR 2026!
[2025.08] Our code, model checkpoint and dataset are open-sourced, Check them out!
Official code repository of Shuffle-R1.
Project website: https://xenozlh.github.io/Shuffle-R1/
Shuffle-R1 is a simple yet principled framework that improves RL fine-tuning efficiency by dynamically restructuring trajectory sampling and batch composition. It introduces two key modules:
- Pairwise Trajectory Sampling (PTS)
- Advantage-based Batch Shuffle (ABS)
Experiments across multiple reasoning benchmarks demonstrate that our framework consistently outperforms strong RL baselines with minimal computational overhead. Specifically, Shuffle-R1 achieves superior performance against GRPO while using only half of the training steps under same settings.
TL;DR: We propose Shuffle-R1, a simple and effective RL post-training framework for MLLM that significantly improves RL training efficiency and model performance.
- model checkpoints (3B and 7B)
- datasets
- training scripts
- inference scripts via Transformers and vLLM
- evaluation scripts
| Model | MathVerse | MathVision | MathVista (mini) | WeMath (loose) | HallusionBench | ChartQA | Avg. |
|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-3B | 34.8 | 21.9 | 58.4 | 51.7 | 59.8 | 73.1 | 49.9 |
| Qwen2.5-VL-7B | 42.6 | 25.8 | 67.4 | 63.5 | 65.2 | 79.8 | 57.4 |
| Shuffle-R1-3B | 44.2 | 26.8 | 70.4 | 66.5 | 69.2 | 79.9 | 59.5 |
| Shuffle-R1-7B | 53.9 | 30.0 | 77.0 | 72.3 | 71.0 | 84.1 | 64.7 |
All models are evaluated under CoT prompt.
3B checkpoint link: Shuffle-R1-Qwen-3B 7B checkpoint link: Shuffle-R1-Qwen-7B
The process is the same as Qwen2.5-VL. Note that it is better to add a "Thinking prompt" at the begining of user query.
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model_path = "path/to/your/checkpoint"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)
system_prompt = """
You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}.
"""
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/your/image"},
{"type": "text", "text": system_prompt + "YOUR TEXT QUERY HERE"},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
We also provide scripts for vLLM inference. You can run the following command to inference with vLLM:
python inference/infer_vllm.py \
--model path/to/your/checkpoint \
--output-dir path/to/save/outputs \
--input-file path/to/your/input/file.jsonl \
--tensor-parallel-size 1 \
--min-pixels 262144 \
--max-pixels 4194304 \
--max-model-len 8192 \
--temperature 0.5
The inference scripts support batch inference. You can organize your inference data in a jsonl file like this:
[
{'image_path': "path/to/image/1", 'question': "question 1"},
{'image_path': "path/to/image/2", 'question': "question 2"},
...
]
Our code is based on EasyR1. Our code follows a non-intrusive design, which keeps the original functions of EasyR1 unchanged.
For environment installation, you can:
- Refer to official instructions.
- Or using the Dockerfile to build the environment.
- Or directly using the pre-built docker image.
git clone https://github.com/xiaomi-research/shuffle-r1.git
cd shuffle-r1
pip install -e .
Download our training data at here.
The training data contains 2.1k samples from Geometry3K and 27k random selected samples from MM-EUREKA dataset. Each sample in the dataset follows the format below:
{
"problem": "your problem", # type: str
"images": [{"bytes": image_bytes, "path": None}], # type: list[dict]
"answer": "your answer", # type: str
"source": "data source" # type: str, not used in training
}
Supported dataset format is the same as EasyR1. You can organize your dataset in the same format as illustrated above. Refer to here for more information.
bash examples/qwen2_5_vl_3b.sh # 3B model training
bash examples/qwen2_5_vl_7b.sh # 7B model training
All training are conducted on 8x H800-80G GPUs.
Download the evaluation benchmark at here. We use Gemini-2.0-flash to evaluate the model response for certain benchmakrs. Make sure to adjust the llm_eval_score_retry function in evaluation/utils/model_parser.py to enable your own API service before evaluation.
cd evaluation
bash eval.sh # start evaluation
Refer to evaluation/eval.sh for more details.
Our work benefit from the following open-source projects:
If you find our work useful for your research, please consider citing:
@misc{zhu2025shuffler1,
title={Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle},
author={Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai},
year={2025},
eprint={2508.05612},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.05612},
}
