Skip to content

rishabhjain16/VSR-LLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

93 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VSP-LLM (Visual Speech Processing incorporated with LLMs)

Introduction

WORK IN PROGRESS This code is Forked from: https://github.com/Sally-SH/VSP-LLM We are working to update it for specific usecase of using Avhubert for Visucal Speech Recognition

Model checkpoint

You can find checkpoint of our model in here. Move the checkpoint to checkpoints.

Preparation

conda create -n vsr-llm python=3.9 -y
conda activate vsr-llm
git clone https://github.com/rishabhjain16/VSR-LLM.git
cd VSR-LLM
(If your pip version > 24.1, please run "pip install --upgrade pip==24.0")
pip install -r requirements.txt
cd fairseq
pip install --editable ./
pip install pip==24.0 
pip install hydra-core==1.0.7 
pip install omegaconf==2.0.4 
pip install numpy==1.23.0
pip install -U bitsandbytes
pip install protobuf==3.20
 
  • Download AV-HuBERT pre-trained model AV-HuBERT Large (LSR3 + VoxCeleb2) from here.
  • Download your preferred LLM from Hugging Face. The code supports multiple LLMs:

Move the AV-HuBERT pre-trained model checkpoint and the LLM checkpoint to checkpoints.

Using Different LLMs

This codebase now supports using different LLMs from Hugging Face. To use a different LLM:

  1. Download your preferred LLM from Hugging Face
  2. Specify the LLM type and path in your configuration:
# For training
python train.py ... \
  --llm-ckpt-path /path/to/your/llm \
  --llm-type llama3  # or mistral, vicuna, etc.

# For decoding
python decode.py ... \
  --llm-ckpt-path /path/to/your/llm \
  --llm-type llama3  # or mistral, vicuna, etc.

The code will automatically:

  • Detect the appropriate embedding dimensions for the model
  • Configure the tokenizer correctly for the model
  • Set up the right LoRA configuration for fine-tuning

Supported LLM Types

  • llama: LLaMA, LLaMA-2, LLaMA-3 models
  • mistral: Mistral models
  • vicuna: Vicuna models

Data preprocessing

Follow Auto-AVSR preparation to preprocess the LRS3 dataset.
Then, follow AV-HuBERT preparation from step 3 to create manifest of LRS3 dataset.

Generate visual speech unit and cluster counts file

Follow the steps in clustering to create:

  • {train,valid}.km frame-aligned pseudo label files. The label_rate is the same as the feature frame rate used for clustering, which is 25Hz for AV-HuBERT features by default.

Dataset layout

.
├── lrs3
│     ├── lrs3_video_seg24s               # Preprocessed video and audio data
│     └── lrs3_text_seg24s                # Preprocessed text data
├── muavic_dataset                        # Mix of VSR data and VST(En-X) data
│     ├── train.tsv                       # List of audio and video path for training
│     ├── train.wrd                       # List of target label for training
│     ├── train.cluster_counts            # List of clusters to deduplicate speech units in training
│     ├── test.tsv                        # List of audio and video path for testing
│     ├── test.wrd                        # List of target label for testing
│     └── test.cluster_counts             # List of clusters to deduplicate speech units in testing
└── test_data
      ├── vsr
      │    └── en
      │        ├── test.tsv 
      │        ├── test.wrd  
      │        └── test.cluster_counts           
      └── vst
           └── en
               ├── es
               :   ├── test.tsv
               :   ├── test.wrd 
               :   └── test.cluster_counts
               └── pt
                   ├── test.tsv
                   ├── test.wrd 
                   └── test.cluster_counts

Test data

The test manifest is provided in labels. You need to replace the path of the LRS3 in the manifest file with your preprocessed LRS3 dataset path using the following command:

cd src/dataset
python replace_path.py --lrs3 /path/to/lrs3

Then modified test amanifest is saved in dataset

Training

Open the training script (scripts/train.sh) and replace these variables:

# path to train dataset dir
DATA_PATH=???

# path where output trained models will be located
OUT_PATH=???

Run the training script:

$ bash scripts/train.sh

Decoding

Open the decoding script (scripts/decode.sh) and replace these variables:

# language direction (e.g 'en' for VSR task / 'en-es' for En to Es VST task)
LANG=???

# path to the trained model
MODEL_PATH=???

# path where decoding results and scores will be located
OUT_PATH=???

Run the decoding script:

$ bash scripts/decode.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published