Caption Generation for Enhancing Road Scene Reasoning in VLMs

⚠️ Project Note: Work in Progress (WiP) > This repository contains active, ongoing research code. The codebase is currently under active development. Contact: For any questions, collaborations, or discussions regarding this project, please feel free to reach out to me at: khursdmy@fel.cvut.cz

Caption Generation for Enhancing Road Scene Reasoning in VLMs

Generate QAs (and captions in the future) for driving scenes to fine-tune Vision–Language Models (VLMs), helping them better describe and reason about automotive scenarios.

🚀 Setup & Installation

Create and activate a Python virtual environment:

python3.11 -m venv roadcap-gen
source roadcap-gen/bin/activate

The project uses Git Submodules to integrate external tools like DriveLM. You must clone recursively:

# Option A: Cloning for the first time
git clone --recurse-submodules [https://github.com/dmitrykhursen/RoadCap-Gen.git](https://github.com/dmitrykhursen/RoadCap-Gen.git)
cd RoadCap-Gen

# Option B: If you already cloned normally
git submodule update --init --recursive

Install project dependencies:

```bash
pip install --upgrade pip
pip install -r requirements.txt
# pip install -e .

▶️ Run the Project

Execute the Python script using hydra config (example for lora training):

python python scripts/02_finetuning/train.py model=llava dataset=qa_dataset training=lora experiment_name=qa_train_debug

Pipeline Overview (acoording to the visualization below but outdated with the code)

QAs_Generation
Generate pseudo ground-truth question–answer pairs from driving scenes
VLM_Finetuning
Fine-tune VLMs in two modes:
- Simple mode – standard visual-text supervision
- Extended mode – with auxiliary latent-space loss to capture geometric information
VLM_Eval
Evaluate VLMs via:
- DriveLM Benchmark – tested on their HF server (GT answers not accessible)
- NLP Evaluation – split data into train/validation sets and compute metrics such as BLEU, CIDEr, ROUGE, LLaMA/ChatGPT scores
VLM_Inference (optional)
Run inference on new driving scenes.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
assets		assets
configs		configs
data		data
external		external
scripts		scripts
src		src
vllm_tests		vllm_tests
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Caption Generation for Enhancing Road Scene Reasoning in VLMs

🚀 Setup & Installation

▶️ Run the Project

Pipeline Overview (acoording to the visualization below but outdated with the code)

Pipeline Visualization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Caption Generation for Enhancing Road Scene Reasoning in VLMs

🚀 Setup & Installation

▶️ Run the Project

Pipeline Overview (acoording to the visualization below but outdated with the code)

Pipeline Visualization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages