VICI: VLM-Instructed Cross-view Image-localisation

Xiaohan Zhang* Tavis Shore* Chen Chen
Oscar Mendez Simon Hadfield Safwan Wshah

Vermont Artificial Intelligence Laboratory (VaiL)
Centre for Vision, Speech, and Signal Processing (CVSSP)
University of Central Florida Locus Robotics

📓 Description

🧬 Feature Extractors

Backbone	Params (M)	FLOPs (G)	Dims	R@1	R@5	R@10
ConvNeXt-T	28	4.5	768	1.36	4.34	7.95
ConvNeXt-B	89	15.4	1024	3.14	8.14	13.22
ViT-B	86	17.6	768	3.30	8.92	13.96
ViT-L	307	60.6	1024	9.62	23.42	32.73
DINOv2-B	86	152	768	17.37	36.14	46.96
DINOv2-L	304	507	1024	27.49	51.96	63.13

🧰 Vision-Language Models

VLM	R@1	R@5	R@10
Without Re-ranking	27.49	51.96	63.13
Gemini 2.5 Flash Lite	23.54	48.39	63.13
Gemini 2.5 Flash	30.21	53.04	63.13

🛸 Drone Augmentation

$P$	R@1	R@5	R@10
0	24.47	48.16	60.99
0.1	26.98	51.34	61.92
0.3	27.49	51.96	63.13
0.5	24.89	52.03	62.66

🎯 Ablation study and baseline comparison.

Model	R@1	R@5	R@10
U1652~\cite{zheng2020university}	1.20	-	-
LPN w/o drone~\cite{wang2021each}	0.74	-	-
LPN w/ drone~\cite{wang2021each}	0.81	-	-
DINOv2-L	24.66	48.00	59.02
+ Drone Data	27.49	51.96	63.13
+ VLM Re-rank (Ours)	30.21	53.04	63.13

📊 Evaluation

🐍 Environment Setup

conda env create -n ENV -f requirements.yaml && conda activate ENV

🐍 Stage 1 - Image Retrieval

Before running Stage 1, configure your dataset paths:

Navigate to the /config/ directory.
Open the default.yaml file (or copy it to a new file).
Replace the placeholder values (e.g., DATA_ROOT) with the actual paths to your dataset and related files.

Once your configuration file is ready, you can train Stage 1 using:

python stage_1.py --config YOUR_CONFIG_FILE_NAME

You can also download our pre-trained weights here.

🐍 Stage 2 - VLM Re-ranking

To run Stage 2, you need to:

Open the stage_2.py file.
Replace the relevant placeholders (e.g., the path to the answer file from Stage 1 and your Gemini API key).
Ensure any other required directories or options are correctly set.

Then, simply run:

python stage_2.py

This will perform re-ranking using a Vision-Language Model (VLM) on top of the initial retrieval results. There will be a LLM_re_ranked_answer.txt in the answer directory and a reasons.json containing all the reasons for re-ranking.

Name		Name	Last commit message	Last commit date
Latest commit History 187 Commits
config		config
src		src
.gitignore		.gitignore
README.md		README.md
env.yaml		env.yaml
stage_1.py		stage_1.py
stage_2.py		stage_2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VICI: VLM-Instructed Cross-view Image-localisation

📓 Description

🧬 Feature Extractors

🧰 Vision-Language Models

🛸 Drone Augmentation

🎯 Ablation study and baseline comparison.

📊 Evaluation

🐍 Environment Setup

🐍 Stage 1 - Image Retrieval

🐍 Stage 2 - VLM Re-ranking

📗 Related Works

🕺 PEnG: Pose-Enhanced Geo-Localisation

⛰️ GeoDTR+: Toward Generic Cross-View Geolocalization via Geometric Disentanglement

⭐ Star History

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VICI: VLM-Instructed Cross-view Image-localisation

📓 Description

🧬 Feature Extractors

🧰 Vision-Language Models

🛸 Drone Augmentation

🎯 Ablation study and baseline comparison.

📊 Evaluation

🐍 Environment Setup

🐍 Stage 1 - Image Retrieval

🐍 Stage 2 - VLM Re-ranking

📗 Related Works

🕺 PEnG: Pose-Enhanced Geo-Localisation

⛰️ GeoDTR+: Toward Generic Cross-View Geolocalization via Geometric Disentanglement

⭐ Star History

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages