Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers
Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi
Our method, evict3r, manages the growing key–value (KV) cache of StreamVGGT by introducing a layer-wise token eviction framework.
- [2025/9/27] code release.
- [2025/9/22] Paper released on arXiv.
Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key–value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.
- Clone StreamVGGT
git clone https://github.com/soroush-mim/evict3r.git
cd evict3r- Create conda environment
conda create -n evict3r python=3.11 cmake=3.14.0
conda activate evict3r - Install requirements
pip install -r requirements.txt
conda install 'llvm-openmp<16'Please download checkpoint of StreamVGGT from Hugging Face or Tsinghua cloud.
Please refer to MonST3R and Spann3R to prepare Sintel, KITTI, 7scenes and Neural-RGBD datasets.
The overall folder structure should be organized as follows:
evict3r
├── ckpt/
| ├── model.pt
| └── checkpoints.pth
├── config/
| ├── ...
├── data/
│ ├── eval/
| | ├── 7scenes
| | ├── bonn
| | ├── kitti
| | ├── neural_rgbd
| | ├── nyu-v2
| | ├── scannetv2
| | └── sintel
│ ├── train/
│ │ ├── processed_arkitscenes
| | ├── ...
└── src/
├── ...
The evaluation code follows MonST3R, CUT3R, VGGT and StreamVGGT.
cd src/bash eval/monodepth/run.sh Results will be saved in eval_results/monodepth/${data}_${model_name}/metric.json.
bash eval/video_depth/run.sh Results will be saved in eval_results/video_depth/${data}_${model_name}/result_scale.json.
bash eval/mv_recon/run.sh Results will be saved in eval_results/mv_recon/${model_name}_${ckpt_name}/logs_all.txt.
The inference_video.py script allows you to run Evict3R on video files to generate 3D reconstructions and depth maps. The script automatically extracts frames from the input video, processes them through the StreamVGGT model with token eviction, and creates 3D visualizations.
python inference_video.py --video path/to/your/video.mp4# Run inference with custom frame sampling and visualization settings
python inference_video.py \
--video path/to/your/video.mp4 \
--out_dir custom_output_directory \
--fps_interval 1.0 \
--conf_thres 2.5 \
--show_cam \
--mask_black_bg \
--eviction \
--P 0.5 \
--temp 0.5To analyze the attention patterns during inference:
# Extract attention maps from specific layers (e.g., layers 0, 5, 11)
python inference_video.py \
--video path/to/your/video.mp4 \
--attn_layers "0,5,11" \
--out_dir output_with_attention--video: Path to input video file (required)--ckpt: Path to StreamVGGT checkpoint (default: automatic download from HuggingFace)--out_dir: Output directory for results (default: "output_streamvggt")--fps_interval: Extract 1 frame every N seconds (default: 2.5)--conf_thres: Confidence threshold for 3D visualization (default: 3.0)--show_cam: Show camera poses in 3D visualization--mask_black_bg: Mask black background pixels--mask_white_bg: Mask white background pixels--mask_sky: Apply sky segmentation mask--attn_layers: Comma-separated layer indices for attention visualization--no_3d_viz: Skip 3D GLB file generation--device: Computing device (default: "cuda")--eviction: use eviction--P: eviction budget--temp: tempratue for per layer budget allocation
Quick 3D reconstruction from video: (eviction with budget=0.5 and temp=0.5)
python inference_video.py --video demo.mp4 --fps_interval 0.5 --eviction --P 0.5 --temp 0.5High-quality reconstruction with camera visualization:
python inference_video.py --video demo.mp4 --conf_thres 4.0 --show_cam --mask_black_bgResearch analysis with attention maps:
python inference_video.py --video demo.mp4 --attn_layers "0,3,6,9,11" --out_dir research_outputOur code is based on the following brilliant repositories:
DUSt3R MonST3R Spann3R CUT3R VGGT Point3R StreamVGGT
Many thanks to these authors!
If you find this project helpful, please consider citing the following paper:
@misc{mahdi2025evict3rtrainingfreetokeneviction,
title={Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers},
author={Soroush Mahdi and Fardin Ayar and Ehsan Javanmardi and Manabu Tsukada and Mahdi Javanmardi},
year={2025},
eprint={2509.17650},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.17650},
}
