Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Paper | Project Page

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Soroush Mahdi, Fardin Ayar, Ehsan Javanmardi, Manabu Tsukada, Mahdi Javanmardi

Our method, evict3r, manages the growing key–value (KV) cache of StreamVGGT by introducing a layer-wise token eviction framework.

News

[2025/9/27] code release.
[2025/9/22] Paper released on arXiv.

Overview

Streaming visual transformers like StreamVGGT achieve strong 3D perception but suffer from unbounded growth of key–value (KV) memory, which limits scalability. We propose a training-free, inference-time token eviction policy that bounds memory by discarding redundant tokens while keeping the most informative ones. Our method uses significantly less memory with little to no drop in accuracy: on 7-Scenes with long sequences it reduces peak memory from 18.63 GB to 9.39 GB while accuracy and completeness drop by only 0.003. Under strict memory budgets, eviction enables denser frame sampling, which improves reconstruction accuracy compared to the baseline. Experiments across video depth estimation (Sintel, KITTI), 3D reconstruction (7-Scenes, NRGBD), and camera pose estimation (Sintel, TUM-dynamics) show that our approach closely matches StreamVGGT at a fraction of the memory and makes long-horizon streaming inference more practical.

Installation

Clone StreamVGGT

git clone https://github.com/soroush-mim/evict3r.git
cd evict3r

Create conda environment

conda create -n evict3r python=3.11 cmake=3.14.0
conda activate evict3r

Install requirements

pip install -r requirements.txt
conda install 'llvm-openmp<16'

Download Checkpoints

Please download checkpoint of StreamVGGT from Hugging Face or Tsinghua cloud.

Data Preparation

Evaluation Datasets

Please refer to MonST3R and Spann3R to prepare Sintel, KITTI, 7scenes and Neural-RGBD datasets.

Folder Structure

The overall folder structure should be organized as follows：

evict3r
├── ckpt/
|   ├── model.pt
|   └── checkpoints.pth
├── config/
|   ├── ...
├── data/
│   ├── eval/
|   |   ├── 7scenes
|   |   ├── bonn
|   |   ├── kitti
|   |   ├── neural_rgbd
|   |   ├── nyu-v2
|   |   ├── scannetv2
|   |   └── sintel
│   ├── train/
│   │   ├── processed_arkitscenes
|   |   ├── ...
└── src/
    ├── ...

Evaluation

The evaluation code follows MonST3R, CUT3R, VGGT and StreamVGGT.

cd src/

Monodepth

bash eval/monodepth/run.sh

Results will be saved in eval_results/monodepth/${data}_${model_name}/metric.json.

VideoDepth

bash eval/video_depth/run.sh

Results will be saved in eval_results/video_depth/${data}_${model_name}/result_scale.json.

Multi-view Reconstruction

bash eval/mv_recon/run.sh

Results will be saved in eval_results/mv_recon/${model_name}_${ckpt_name}/logs_all.txt.

Video Inference

The inference_video.py script allows you to run Evict3R on video files to generate 3D reconstructions and depth maps. The script automatically extracts frames from the input video, processes them through the StreamVGGT model with token eviction, and creates 3D visualizations.

Basic Usage

python inference_video.py --video path/to/your/video.mp4

Advanced Usage with Custom Parameters

# Run inference with custom frame sampling and visualization settings
python inference_video.py \
  --video path/to/your/video.mp4 \
  --out_dir custom_output_directory \
  --fps_interval 1.0 \
  --conf_thres 2.5 \
  --show_cam \
  --mask_black_bg \
  --eviction \
  --P 0.5 \
  --temp 0.5

Extract Attention Maps (Optional)

To analyze the attention patterns during inference:

# Extract attention maps from specific layers (e.g., layers 0, 5, 11)
python inference_video.py \
  --video path/to/your/video.mp4 \
  --attn_layers "0,5,11" \
  --out_dir output_with_attention

Parameters

--video: Path to input video file (required)
--ckpt: Path to StreamVGGT checkpoint (default: automatic download from HuggingFace)
--out_dir: Output directory for results (default: "output_streamvggt")
--fps_interval: Extract 1 frame every N seconds (default: 2.5)
--conf_thres: Confidence threshold for 3D visualization (default: 3.0)
--show_cam: Show camera poses in 3D visualization
--mask_black_bg: Mask black background pixels
--mask_white_bg: Mask white background pixels
--mask_sky: Apply sky segmentation mask
--attn_layers: Comma-separated layer indices for attention visualization
--no_3d_viz: Skip 3D GLB file generation
--device: Computing device (default: "cuda")
--eviction: use eviction
--P: eviction budget
--temp: tempratue for per layer budget allocation

Example Workflows

Quick 3D reconstruction from video: (eviction with budget=0.5 and temp=0.5)

python inference_video.py --video demo.mp4 --fps_interval 0.5 --eviction --P 0.5 --temp 0.5

High-quality reconstruction with camera visualization:

python inference_video.py --video demo.mp4 --conf_thres 4.0 --show_cam --mask_black_bg

Research analysis with attention maps:

python inference_video.py --video demo.mp4 --attn_layers "0,3,6,9,11" --out_dir research_output

Acknowledgements

Our code is based on the following brilliant repositories:

DUSt3R MonST3R Spann3R CUT3R VGGT Point3R StreamVGGT

Many thanks to these authors!

Citation

If you find this project helpful, please consider citing the following paper:

@misc{mahdi2025evict3rtrainingfreetokeneviction,
      title={Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers}, 
      author={Soroush Mahdi and Fardin Ayar and Ehsan Javanmardi and Manabu Tsukada and Mahdi Javanmardi},
      year={2025},
      eprint={2509.17650},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2509.17650}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.vscode		.vscode
assets		assets
cloud_opt		cloud_opt
config		config
datasets_preprocess		datasets_preprocess
examples		examples
lib		lib
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
demo_gradio.py		demo_gradio.py
inference_video.py		inference_video.py
requirements.txt		requirements.txt
requirements_demo.txt		requirements_demo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Paper | Project Page

News

Overview

Installation

Download Checkpoints

Data Preparation

Evaluation Datasets

Folder Structure

Evaluation

Monodepth

VideoDepth

Multi-view Reconstruction

Video Inference

Basic Usage

Advanced Usage with Custom Parameters

Extract Attention Maps (Optional)

Parameters

Example Workflows

Acknowledgements

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evict3R: Training-Free Token Eviction for Memory-Bounded Streaming Visual Geometry Transformers

Paper | Project Page

News

Overview

Installation

Download Checkpoints

Data Preparation

Evaluation Datasets

Folder Structure

Evaluation

Monodepth

VideoDepth

Multi-view Reconstruction

Video Inference

Basic Usage

Advanced Usage with Custom Parameters

Extract Attention Maps (Optional)

Parameters

Example Workflows

Acknowledgements

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages