1. School of Computer Science, Shanghai Jiao Tong University
2. Ant Group
3. Zhongguancun Academy
4. Shanghai Innovation Institute
📃 Paper | 🤗 Models & Training Datasets & ZoomBench
Recent "Thinking-with-Images" methods improve fine-grained perception by iteratively zooming into regions of interest during inference, but incur high latency due to repeated tool calls and visual re-encoding. In this work, we present ZwZ models (4/7/8B), achieving SOTA performance on multimodal perception benchmarks among open-source models. In addition, we present ZoomBench, a hybrid-annotated benchmark of 845 VQA data spanning six fine-grained perceptual dimensions, together with a dual-view protocol that quantifies the global--regional "zooming gap".
We propose Region-to-Image Distillation (R2I), which transforms zooming from an inference-time tool into a training-time primitive. We:
- Zoom in to micro-cropped regions and let strong teacher models generate high-quality VQA data
- Distill this region-grounded supervision back to the full image with explicit bounding-box overlays
- Enable smaller student models to achieve single-glance fine-grained perception without tool use
This can also be summarized as an idea of "Zooming without Zooming". The first "Zooming" refers to the training-time primitive: we zoom into micro-regions to synthesize fine-grained training data. In contrast, the second "Zooming" denotes the inference-time tool-use we seek to bypass.
-
🎯 Superior Accuracy: Achieve SOTA performance on perception benchmarks among open-source models
-
⚡ Single-Pass Efficiency: Just need one forward pass, eliminating inference-time tool calling overhead
-
📈 Broad Improvements: Enhance not only perception benchmarks but also out-of-distribution generalization on visual reasoning, GUI agent, and AIGC detection
-
🔍 ZoomBench: A comprehensive benchmark with 845 samples across 6 fine-grained dimensions, featuring various evaluation protocols
| Model | Base | Download |
|---|---|---|
| ZwZ-2B | Qwen3-VL-2B | 🤗 inclusionAI/ZwZ-2B |
| ZwZ-4B | Qwen3-VL-4B | 🤗 inclusionAI/ZwZ-4B |
| ZwZ-7B | Qwen2.5-VL-7B | 🤗 inclusionAI/ZwZ-7B |
| ZwZ-8B | Qwen3-VL-8B | 🤗 inclusionAI/ZwZ-8B |
Our Region-to-Image distilled training data (74K samples): 🤗 inclusionAI/ZwZ-RL-VQA
Source image pools:
- SA-1B, LAION, MetaCLIP, Visual Genome, CC12M, STPLS3D (we just take a small part of images from each image pool; most of high resolution images are from train-0000-of-0013.parquet in https://modelscope.cn/datasets/Tongyi-DataEngine/SA1B-Paired-Captions-Images)
Question Generator: Qwen3-VL-235B-A22B-Instruct
Answer Generators: Qwen3-VL-235B-A22B-Instruct, GLM-4.5V
We introduce 🤗 ZoomBench, a challenging benchmark for fine-grained multimodal perception:
-
845 high-quality samples across 6 perceptual dimensions:
- Fine-Grained Counting
- OCR (text & symbol recognition)
- Color Attributes
- Structural Attributes
- Material Attributes
- Object Identification
-
Dual-View Protocol: Each sample includes both full image and cropped region to quantify the "zooming gap"
-
Attention Map Analysis: Evaluate whether the model grounds its predictions on task-relevant image regions from a view of interpretability
-
Hybrid Construction: Gemini-2.5-Pro-generated + human-verified for quality and scalability
-
High Difficulty: Average accuracy of Qwen2.5-VL-7B is only 42.5%
git clone https://github.com/inclusionAI/Zooming-without-Zooming.git
cd Zooming-without-Zooming
pip install -r requirements.txt
git clone https://github.com/facebookresearch/sam3.git
cd sam3
pip install -e . # please refer to the official repo of SAM3 for detailed installation
cd ../EasyR1
pip install -e . # please refer to the official repo of EasyR1 for detailed installationThe pipeline supports checkpointing. Each step can be executed independently and resumed from any stage. Note that we use Qwen3-VL-235B and Sam3 to get a meaningful cropped image, and use Kimi-K2 to extract the majority answer.
cd Zooming-without-Zooming/data_synthesis
export MLLM_KEY="your_mllm_key"
export MLLM_URL="your_mllm_url"
export KIMI_KEY="your_llm_key"
export KIMI_URL="your_llm_url"
## step 1
python create_crops.py \
--api_key "$MLLM_KEY" \
--api_url "$MLLM_URL" \
--image_folders "/path/images/sa1b" \ # Support multiple folders; replace to your own path (just containing images)
--output_jsonl "generated_bboxes_sa1b.jsonl"
## step 2
python create_questions.py \
--api_key "$MLLM_KEY" \
--api_url "$MLLM_URL" \
--input_files "generated_bboxes_sa1b.jsonl" \
--output_file "generated_questions.jsonl" \
--crop_output_dir "/path/images/crops" # Replace to your own path
## step 3
bash qwen_serve.sh
python create_answers.py \
--api_key "$MLLM_KEY" \
--api_url "$MLLM_URL" \
--kimi_api_key "$KIMI_KEY" \
--kimi_api_url "$KIMI_URL" \
--input_file "generated_questions.jsonl" \
--output_file "validated_vqa.jsonl" \
--bbox_output_dir "/path/images/bbox_images" # Replace to your own path
## step 4
python convert_jsonl2parquet.py \
--input_file "validated_vqa.jsonl" \
--output_file "validated_vqa.parquet"We also provide an end-to-end data synthesis script.
cd Zooming-without-Zooming/data_synthesis
export MLLM_KEY="your_mllm_key"
export MLLM_URL="your_mllm_url"
export KIMI_KEY="your_llm_key"
export KIMI_URL="your_llm_url"
bash qwen_serve.sh
python create_vqa.py \
--api_key "$MLLM_KEY" \
--api_url "$MLLM_URL" \
--kimi_api_key "$KIMI_KEY" \
--kimi_api_url "$KIMI_URL" \
--image_folders "/path/images/sa1b" \
--crop_output_dir "/path/images/crops" \
--bbox_output_dir "/path/images/bbox_images" \
--output_parquet "validated_vqa.parquet" \
--output_jsonl "validated_vqa.jsonl"You can use the generated "validated_vqa.parquet" as the training dataset. Or, you can also use ours: download train.parquet and images in inclusionAI/ZwZ-RL-VQA.
Then, you can start training!
cd Zooming-without-Zooming/EasyR1
# For single node, total 16x GPUS, 4 GPUs for reward model, 12 GPUs for training
bash reward.sh
bash single_node.sh # remember to modify the training data path to your own path. you can also add your own eval dataset.
# For multi node
# First, you need to have one or more reward model service URLs. You can refer to reward.sh to deploy them yourself, then update the URLs in example/reward_function/perception_multinode.py.
bash multi_node.sh
# Merge checkpoint to Hugging Face format
python scripts/model_merger.py --local_dir ./verl_exp/qwen3_vl_8b_perception/global_step_140/actorFirstly, please convert the benchmark data to the format below, and save it as a json file:
{
"images": [
"/path/your_file/000002.png"
],
"query": "How many table lamps are in the image? Select from the following choices.\n(A) 0\n(B) 2\n(C) 1\n(D) 3",
"response": "C"
}
We also provide an example script for the conversion of our ZoomBench:
cd utils
python convert_benchmark.pyThen, you can evaluate the model using the following script (remember to modify the benchmark dataset path):
cd mm-eval
# Benchmark scores (including the dual-view evaluation)
bash run_baseline.sh
cd ../utils
# Attention Map Coverage
python eval_coverage.pyThis project builds upon:
- Qwen2.5-VL and Qwen3-VL for base models
- EasyR1 for RL training framework
We also sincerely thank Zhiheng Wang (SJTU, Shanghai AI Lab) for his insightful and valuable suggestions.
For questions or collaborations, please contact:
- Lai Wei: waltonfuture@sjtu.edu.cn
- Liangbo He: liangbo.hlb@antgroup.com
- Jun Lan: yelan.lj@antgroup.com
- Zhuosheng Zhang: zhangzs@sjtu.edu.cn
- Weiran Huang: weiran.huang@sjtu.edu.cn
Please note that the code may contain minor bugs related to dataset paths. We appreciate any feedback or contributions. Thank you for your understanding and support!
@article{wei2026zooming,
title={Zooming without Zooming: Region-to-Image Distillation for Fine-Grained Multimodal Perception},
author={Wei, Lai and He, Liangbo and Lan, Jun and Dong, Lingzhong and Cai, Yutong and Li, Siyuan and Zhu, Huijia and Wang, Weiqiang and Kong, Linghe and Wang, Yue and Zhang, Zhuosheng and Huang, Weiran},
journal={arXiv preprint arXiv:2602.11858},
year={2026}
}
