This repository contains steps to train NVIDIA/tacotron2 on a multi speaker Hindi Language dataset.
You can play the demo here.
- Pre-requisites
- Setup
- 2.1 Torch from binary
- 2.2 Apex
- 2.3 Other Python requirements
- Dataset
- Data Preprocessing
- 4.1 OpenSLR
- 4.1.1 Upsample to 22050 Hz
- 4.1.2 Creating train and test text files
- 4.1.3 Update hparams.py
- 4.2 IIIT-Hyd
- 4.2.1 Downsample to 22050 Hz
- 4.2.2 Creating train and test text files
- 4.2.3 Update hparams.py
- 4.1 OpenSLR
- Training
- Inference
- Related repos
- Acknowledgements
- NVIDIA GPU
- NVIDIA CUDA installation. More on it here.
2.1 Torch from binary
Clone the repository
git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive --jobs 0
Build and install
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop
2.2 Apex
Apex is used for mixed precision and distribued training.
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
Other pythonic dependencies are listed in requirements.txt
pip3 install -r requirements.txt
This version of Tacotron2 is trained with two different datasets. You can choose either of them.
- This dataset is a high-quality Hindi multi-speaker speech dataset from OpenSLR
- The Hindi speech dataset is split into train and test sets with 95.05 hours and 5.55 hours of audio respectively
- There are 4506 and 386 unique sentences taken from Hindi stories in the train and test sets, respectively, with no overlap of sentences. The train set contains utterances from a set of 59 speakers, and the test set contains speakers from a disjoint set of 19 speakers
- The audio files are sampled at 8kHz, 16-bit encoding. The total vocabulary size of the train and test set is 6542
- This dataset consists of single-speaker samples from IIIT-Hyd
- There are 9368 samples available for training
- The data has a sampling rate of 48kHz
# Download train dataset
wget https://www.openslr.org/resources/103/Hindi_train.tar.gz
# Extract train dataset
tar xvf Hindi_train.tar.gz
# Download test dataset
wget https://www.openslr.org/resources/103/Hindi_test.tar.gz
# Extract test dataset
tar xvf Hindi_test.tar.gz
# Copy the data to dataset folder
mkdir HindiDataset
mv train HindiDataset/
mv test HindiDataset/
Request for the dataset here.
# Copy the data to dataset folder
mkdir HindiDataset
mv Dataset HindiDataset/train_raw
OpenSLR data consists of transcription.txt
file. This needs to be converted into the format
compatible with tacotron2 training.
4.1.1 Upsample to 22050 Hz
python3 upsampler.py HindiDataset/train/audio/
python3 upsampler.py HindiDataset/test/audio/
4.1.2 Creating train and test text files
Run filelist_creator.py
to create text files for training
python3 filelist_creator.py HindiDataset/
4.1.3 Update hparams.py
Open this file and change training_files
and validation_files
accordingly
training_files='filelists/openslr_hindi_train.txt',
validation_files='filelists/openslr_hindi_test.txt',
This dataset provides an annotations.csv
file. This needs to be converted into the format compatible with tacotron2 training.
4.2.1 Downsample the data to 22050 Hz
mkdir -p HindiDataset/train
python3 format_changer.py HindiDataset/train_raw/ HindiDataset/train/
4.2.2 Create train and test text files
cp annotations.csv filelists/iiit-hyd_hindi_train.txt
cp annotations.csv filelists/iiit-hyd_hindi_test.txt
4.2.3 Update hparams.py
Open this file and change training_files
and validation_files
accordingly
training_files='filelists/iiit-hyd_hindi_train.txt',
validation_files='filelists/iiit-hyd_hindi_test.txt',
Training using a pre-trained model can lead to faster convergence By default, the dataset dependent text embedding layers are ignored
Download NVIDIA pre-trained Tacotron 2 model
python3 train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start
You can train a model from scratch
python3 train.py --output_directory=outdir --log_directory=logdir
python3 -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True
Model accuracy and alignment can be easily monitored using tensorboard
tensorboard --logdir=outdir/logdir
- Download waveglow from here
- Download pre-trained openslr hindi from here
- Download pre-trained iiit-hyd hindi from here
N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.
jupyter notebook --ip=127.0.0.1 --port=31337
Run inference.ipynb
.
The vocoders supported in this repository are WaveGlow
, MelGAN
and HiFiGAN
.
First install NeMo
git clone https://github.com/NVIDIA/NeMo.git
cd NeMo
./reinstall.sh
Use the inference.ipynb
notebook to use different vocoders.
NVIDIA/tacotron2 Original work this repository is inspired from.
WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis
nv-wavenet Faster than real time WaveNet.
This implementation uses code from the following repos: Keith Ito, Prem Seetharaman as described in our code.
We are inspired by Ryuchi Yamamoto's Tacotron PyTorch implementation.
We are thankful to the Tacotron 2 paper authors, specially Jonathan Shen, Yuxuan Wang and Zongheng Yang.