- 2025.10.10: We release our model, dataset, benchmarks and codes.
- 2025.10.10: We release our paper.
Spatial reasoning remains a fundamental challenge for Vision-Language Models (VLMs), with current approaches struggling to achieve robust performance despite recent advances. We identify that this limitation stems from a critical gap: existing methods attempt to learn spatial reasoning directly without establishing the hierarchical foundations of perception and understanding. To address this challenge, we present a comprehensive methodology for building spatial intelligence progressively. We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single-image, multi-view, and video spatial reasoning tasks, constructed through a standardized pipeline that ensures systematic coverage across modalities. Building on this dataset, we design a three-stage progressive training framework that (1) establishes spatial perception through object localization, (2) develops spatial understanding through multi-dimensional spatial tasks, and (3) strengthens complex reasoning via reinforcement learning with verifiable rewards. This approach yields SpatialLadder, a 3B-parameter model that achieves state-of-the-art performance on spatial reasoning benchmarks, with 23.4% average improvement over the base model, surpassing GPT-4o by 20.8% and Gemini-2.0-Flash by 10.1%. Notably, SpatialLadder maintains strong generalization with 7.2% improvement on out-of- domain benchmarks, demonstrating that progressive training from perception to reasoning is essential for robust spatial intelligence.
Effective spatial reasoning requires diverse, high-quality training data spanning from basic perception to complex reasoning. We introduce SpatialLadder-26k, comprising 26,610 samples across four complementary task categories that form a complete spatial learning curriculum.
We implement SpatialLadder-3B using Qwen2.5-VL-3B as the foundation model and evaluate it on six benchmarks across in-domain and out-of-domain settings. The results demonstrate that SpatialLadder-3B achieves state-of-the-art performance on in-domain evaluation benchmarks, while also delivering substantial improvements on out-of-domain datasets, thereby validating the effectiveness and generalizability of our training corpus.
git clone https://github.com/ZJU-REAL/SpatialLadder.git
conda create -n spatial-ladder python=3.10 -y
conda activate spatial-ladder
cd SpatialLadder
bash setup.shThe model is trained on SpatialLadder-26k, which we constructed using a standardized annotation pipeline based on ScanNet. Please make sure to download and prepare the dataset, and place the files in the VLM-R1/data/images folder before starting training.
To train the model through all stages automatically:
cd VLM-R1/run_scripts
bash run_spld_all.shThis will sequentially execute Stage 1, Stage 2, and Stage 3 training processes. Each stage must complete successfully before the next one begins.
For manual control or debugging purposes, you can run each training stage individually:
cd VLM-R1/run_scripts
bash run_spld_stage1.sh # Stage 1
bash run_spld_stage1_2.sh # Stage 2
bash run_spld_stage1_2_cs.sh # Cold Start
bash run_spld_stage1_2_cs_stage3.sh # Stage 3Note: Make sure to run stages in the correct order, as each stage depends on the outputs from previous stages.
The supported evaluation datasets include VSI-Bench, SPBench, CV-Bench, SPAR-Bench, and ViewSpatial-Bench. Before running evaluations, please make sure to download the required datasets and update their paths in the eval_spld/evaluator.py file.
To evaluate the trained model:
cd VLM-R1/eval_spld
bash run_eval.shThis will run the evaluation pipeline using the default configuration.
To modify evaluation settings, edit the run_eval.sh script directly:
MODEL_NAMES=("qwenvl_3b")
TASK=("VSI-Bench")
SUPPORTED_TASKS=("VSI-Bench" "SPBench-SI" "SPBench-MI" "SPAR-Bench" "ViewSpatial-Bench" "CV-Bench")
...Note: Ensure your model checkpoint path is correct and the evaluation data is properly prepared before running the evaluation script. The SpatialLadder-3B checkpoint is avaliabel in our Hugging Face repository.
Our training framework is built upon TRL and VLM-R1. We sincerely thank the developers of these projects for their valuable contributions to the open-source community.
If you find SpatialLadder useful, please consider citing our work:
@misc{li2025spatialladderprogressivetrainingspatial,
title={SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models},
author={Hongxing Li and Dingming Li and Zixuan Wang and Yuchen Yan and Hang Wu and Wenqi Zhang and Yongliang Shen and Weiming Lu and Jun Xiao and Yueting Zhuang},
year={2025},
eprint={2510.08531},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2510.08531},
}
