Generate QAs (and captions in the future) for driving scenes to fine-tune Vision–Language Models (VLMs), helping them better describe and reason about automotive scenarios.
Create and activate a Python virtual environment:
python3.11 -m venv roadcap-gen
source roadcap-gen/bin/activateThe project uses Git Submodules to integrate external tools like DriveLM. You must clone recursively:
# Option A: Cloning for the first time
git clone --recurse-submodules [https://github.com/dmitrykhursen/RoadCap-Gen.git](https://github.com/dmitrykhursen/RoadCap-Gen.git)
cd RoadCap-Gen
# Option B: If you already cloned normally
git submodule update --init --recursive
Install project dependencies:
```bash
pip install --upgrade pip
pip install -r requirements.txt
# pip install -e .Execute the Python script using hydra config (example for lora training):
python python scripts/02_finetuning/train.py model=llava dataset=qa_dataset training=lora experiment_name=qa_train_debug-
QAs_Generation
Generate pseudo ground-truth question–answer pairs from driving scenes -
VLM_Finetuning
Fine-tune VLMs in two modes:- Simple mode – standard visual-text supervision
- Extended mode – with auxiliary latent-space loss to capture geometric information
-
VLM_Eval
Evaluate VLMs via:- DriveLM Benchmark – tested on their HF server (GT answers not accessible)
- NLP Evaluation – split data into train/validation sets and compute metrics such as BLEU, CIDEr, ROUGE, LLaMA/ChatGPT scores
-
VLM_Inference (optional)
Run inference on new driving scenes.
