We are currently preparing the code for public release. Stay tuned! π
The code and pre-trained models will be released soon, including:
- Training and inference scripts
- Pre-trained weights for all datasets
- Dataset preparation guidelines
- Evaluation benchmarks
β Star this repository to get notified when the code is released!
UniSurgSAM is a universal promptable video object segmentation (PVOS) framework designed for reliable surgical video segmentation. It supports visual, textual, and audio prompts within a unified architecture, enabling flexible human-AI interaction for computer-assisted surgery.
- π― Multi-Modal Prompts: Visual, textual, and audio prompts within a unified architecture
- β‘ Real-Time Performance: 55 FPS (linguistic) / 68 FPS (visual)
- π₯ Clinical Reliability: Presence-aware decoding to suppress hallucinations
- π Multi-Granular: Whole-object, part-level, and subpart segmentation
- π Closed-Loop Design: Automatic failure recovery via adaptive state transition
UniSurgSAM employs a two-stage paradigm with decoupled decoders:
- Stage I: Multi-modal promptable initialization with RPAD (Reliable Presence-Aware Decoding)
- Stage II: Boundary-aware long-term tracking (BLT) with diversity-driven memory
- AST: Adaptive state transition for closed-loop failure recovery
For more details, please refer to our paper and project page.
Check out our project page for video demonstrations and detailed results.
If you find our work useful, please consider citing:
@article{liu2025unisurgsam,
title={UniSurgSAM: A Universal Promptable Model for Reliable Surgical Video Segmentation},
author={Liu, Haofeng and Wang, Ziyue and Kong, Alex Y. W. and Qin, Guanyi and Gao, Mingqi and Low, Chang Han and Chan, Lap Yan Lennon and Jin, Yueming},
journal={arXiv preprint},
year={2026}
}For questions or collaborations, please contact:
- Yueming Jin: ymjin@nus.edu.sg
- Haofeng Liu: haofeng.liu@u.nus.edu
This project is licensed under the MIT License - see the LICENSE file for details.