Python scripts for a speech processing pipeline with Voice Activity Detection (VAD), Spoken Language Identification (SLI), and Automatic Speech Recognition (ASR). Our use case involves using VAD to detect time regions in a language documentation recording where someone is speaking, then using SLI to classify each region as either English (eng) or Muruwari (zmu), and then using an English ASR model to transcribe regions detected as English. This pipeline outputs an ELAN .eaf file with the following tier structure (_vad
, _sli
, and _asr
):
-
Install Docker, if necessary
-
Clone this repository and change into directory:
git clone https://github.com/CoEDL/vad-sli-asr.git cd vad-sli-asr
-
Launch Docker image:
a. CPU (use this if you are not sure about the GPU option)
docker-compose run --rm cpu
b. GPU pass-through (tested using CUDA 11.3, cuDNN 8200, Docker 20.10.7 and NVIDIA container toolkit installed on host machine)
docker-compose run --rm gpu
See the commands and commentary in the Dockerfile
for the full set of dependencies.
We do not have permissions to release the Muruwari audio and transcriptions used in the paper. For illustrative purposes, we have included a toy-example
in the data
folder of me (Nay San) saying some words alternating in French and English:
├── data
│ ├── toy-example
│ │ ├── raw/ <- Deployment data, to be passed through vad-sli-asr pipeline
│ │ │ │ ├── hello-goodbye.wav
│ │ ├── clips/ <- Training data (for SLI; one folder per language)
│ │ │ ├── eng/ <- .wav files (English utterances)
│ │ │ │ ├── eng-01.wav <- hello
│ │ │ │ ├── eng-02.wav <- goodbye
│ │ │ ├── fra/ <- .wav files (French utterances)
│ │ │ │ ├── fra-01.wav <- bonjour
│ │ │ │ ├── fra-02.wav <- au revoir
│ │ ├── eng-sentences.tsv <- transcriptions of English clips for ASR training
│ │ ├── eng-texts.txt <- text file of (unrelated) English sentences for language model training (optional)
Detect speech regions in data/toy-example/raw/hello-goodbye.wav
and (by default) write detected regions as annotations on _vad
tier in a side-car ELAN file (default) data/toy-example/raw/hello-goodbye.eaf
python scripts/run_vad-by-silero.py data/toy-example/raw/hello-goodbye.wav
Use utterances supplied in data/toy-example/clips
and train a logistic regression classifier with SpeechBrain embeddings as features and folder names as training labels for the utterances (e.g. eng
, fra
), and save trained classifier to data/toy-example/eng-fra_classifier.pkl
.
python scripts/train_sli-by-sblr.py data/toy-example/clips data/toy-example/eng-fra_classifier.pkl
Use trained classifier located at data/toy-example/eng-fra_classifier.pkl
to classify regions speech regions associated with data/toy-example/raw/hello-goodbye.wav
located on the _vad
tier of data/toy-example/raw/hello-goodbye.eaf
and write the classified regions onto the _sli
tier:
python scripts/run_sli-by-sblr.py data/toy-example/eng-fra_classifier.pkl data/toy-example/raw/hello-goodbye.wav
Use the pre-trained facebook/wav2vec2-large-robust-ft-swbd-300h
on HuggingFace Hub to transcribe regions of interest in the audio file data/toy-example/raw/hello-goodbye.wav
(by default, regions of interest (ROI) are defined as those on the _sli
tier matching the regular expression eng
; note the optional --roi_tier
and --roi_filter
arguments):
python scripts/run_asr-by-w2v2.py \
facebook/wav2vec2-large-robust-ft-swbd-300h \
data/toy-example/raw/hello-goodbye.wav \
--roi_tier _sli \
--roi_filter eng
We have uploaded various models developed for the Muruwari project on Zenodo.
# Download train-100 model from Zenodo (model trained on 100% of data)
wget -O tmp/train-100.zip "https://zenodo.org/record/6456264/files/train-100.zip?download=1"
# Unzip model to tmp/train-100
unzip tmp/train-100.zip -d tmp/
python scripts/run_asr-by-w2v2.py \
tmp/train-100 \
data/toy-example/raw/hello-goodbye.wav
Note. These commands will not work well (or at all) with the toy example given how small the dataset is. Provided for usage illustration only.
# Fine-tune a pre-trained wav2vec 2.0 model without a language model
python scripts/train_asr-by-w2v2-ft.py \
facebook/wav2vec2-large-robust-ft-swbd-300h \ # Starting checkpoint
data/toy-example/my-fine-tuned-model \ # Output directory
data/toy-example/eng-sentences.tsv \ # TSV file for training data
data/toy-example/eng-sentences.tsv # TSV file for evaluation data (using same data twice for illustration only!)
# Build a 2-gram language model built using KenLM
lmplz -o 2 < tmp/toy-example/eng-texts.txt > data/toy-example/eng-2gram.arpa
# Add end-of-sentence token to make ARPA file compatible with pyctcdecode
python scripts/helpers/add_eos-to-arpa.py \
data/toy-example/eng-2gram.arpa \
data/toy-example/eng-2gram_corrected.arpa
# Fine-tune a pre-trained wav2vec 2.0 model with a 2-gram language model
python scripts/train_asr-by-w2v2-ft.py \
facebook/wav2vec2-large-robust-ft-swbd-300h \ # Starting checkpoint
data/toy-example/my-fine-tuned-model \ # Output directory
data/toy-example/eng-sentences.tsv \ # TSV file for training data
data/toy-example/eng-sentences.tsv # TSV file for evaluation data (using same data twice for illustration only!)
--lm_arpa data/toy-example/eng-2gram_corrected.arpa # Corrected language model file