WORK IN PROGRESS This code is Forked from: https://github.com/Sally-SH/VSP-LLM We are working to update it for specific usecase of using Avhubert for Visucal Speech Recognition
You can find checkpoint of our model in here.
Move the checkpoint to checkpoints
.
conda create -n vsr-llm python=3.9 -y
conda activate vsr-llm
git clone https://github.com/rishabhjain16/VSR-LLM.git
cd VSR-LLM
(If your pip version > 24.1, please run "pip install --upgrade pip==24.0")
pip install -r requirements.txt
cd fairseq
pip install --editable ./
pip install pip==24.0
pip install hydra-core==1.0.7
pip install omegaconf==2.0.4
pip install numpy==1.23.0
pip install -U bitsandbytes
pip install protobuf==3.20
- Download AV-HuBERT pre-trained model
AV-HuBERT Large (LSR3 + VoxCeleb2)
from here. - Download your preferred LLM from Hugging Face. The code supports multiple LLMs:
- LLaMA models: LLaMA-2-7b, LLaMA-3
- Mistral models: Mistral-7B
- And other compatible LLMs from Hugging Face
Move the AV-HuBERT pre-trained model checkpoint and the LLM checkpoint to checkpoints
.
This codebase now supports using different LLMs from Hugging Face. To use a different LLM:
- Download your preferred LLM from Hugging Face
- Specify the LLM type and path in your configuration:
# For training
python train.py ... \
--llm-ckpt-path /path/to/your/llm \
--llm-type llama3 # or mistral, vicuna, etc.
# For decoding
python decode.py ... \
--llm-ckpt-path /path/to/your/llm \
--llm-type llama3 # or mistral, vicuna, etc.
The code will automatically:
- Detect the appropriate embedding dimensions for the model
- Configure the tokenizer correctly for the model
- Set up the right LoRA configuration for fine-tuning
llama
: LLaMA, LLaMA-2, LLaMA-3 modelsmistral
: Mistral modelsvicuna
: Vicuna models
Follow Auto-AVSR preparation to preprocess the LRS3 dataset.
Then, follow AV-HuBERT preparation from step 3 to create manifest of LRS3 dataset.
Follow the steps in clustering
to create:
{train,valid}.km
frame-aligned pseudo label files. Thelabel_rate
is the same as the feature frame rate used for clustering, which is 25Hz for AV-HuBERT features by default.
.
├── lrs3
│ ├── lrs3_video_seg24s # Preprocessed video and audio data
│ └── lrs3_text_seg24s # Preprocessed text data
├── muavic_dataset # Mix of VSR data and VST(En-X) data
│ ├── train.tsv # List of audio and video path for training
│ ├── train.wrd # List of target label for training
│ ├── train.cluster_counts # List of clusters to deduplicate speech units in training
│ ├── test.tsv # List of audio and video path for testing
│ ├── test.wrd # List of target label for testing
│ └── test.cluster_counts # List of clusters to deduplicate speech units in testing
└── test_data
├── vsr
│ └── en
│ ├── test.tsv
│ ├── test.wrd
│ └── test.cluster_counts
└── vst
└── en
├── es
: ├── test.tsv
: ├── test.wrd
: └── test.cluster_counts
└── pt
├── test.tsv
├── test.wrd
└── test.cluster_counts
The test manifest is provided in labels
. You need to replace the path of the LRS3 in the manifest file with your preprocessed LRS3 dataset path using the following command:
cd src/dataset
python replace_path.py --lrs3 /path/to/lrs3
Then modified test amanifest is saved in dataset
Open the training script (scripts/train.sh
) and replace these variables:
# path to train dataset dir
DATA_PATH=???
# path where output trained models will be located
OUT_PATH=???
Run the training script:
$ bash scripts/train.sh
Open the decoding script (scripts/decode.sh
) and replace these variables:
# language direction (e.g 'en' for VSR task / 'en-es' for En to Es VST task)
LANG=???
# path to the trained model
MODEL_PATH=???
# path where decoding results and scores will be located
OUT_PATH=???
Run the decoding script:
$ bash scripts/decode.sh