Text to Speech Synthesis for Indic Languages using Tacotron2

About

This repository contains steps to train NVIDIA/tacotron2 on a multi speaker Hindi Language dataset.

Demo

You can play the demo here.

Index

Pre-requisites
Setup
- 2.1 Torch from binary
- 2.2 Apex
- 2.3 Other Python requirements
Dataset
- 3.1 About OpenSLR dataset
- 3.2 About IIIT-Hyderabad Dataset
- 3.3 Download and extract OpenSLR dataset
- 3.4 Download and extract IIIT-Hyd dataset
Data Preprocessing
- 4.1 OpenSLR
  - 4.1.1 Upsample to 22050 Hz
  - 4.1.2 Creating train and test text files
  - 4.1.3 Update hparams.py
- 4.2 IIIT-Hyd
  - 4.2.1 Downsample to 22050 Hz
  - 4.2.2 Creating train and test text files
  - 4.2.3 Update hparams.py
Training
Inference
- 6.1 Download pre-trained models
- 6.2 Run Jupyter notebook
- 6.3 Multiple Vocoders
Related repos
Acknowledgements

1. Pre-requisites

NVIDIA GPU
NVIDIA CUDA installation. More on it here.

2. Setup

2.1 Torch from binary

Clone the repository

git clone --recursive https://github.com/pytorch/pytorch
cd pytorch
# if you are updating an existing checkout
git submodule sync
git submodule update --init --recursive --jobs 0

Build and install

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
python setup.py develop

2.2 Apex

Apex is used for mixed precision and distribued training.

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

2.3 Other Python requirements

Other pythonic dependencies are listed in requirements.txt

pip3 install -r requirements.txt

3. Dataset

This version of Tacotron2 is trained with two different datasets. You can choose either of them.

3.1 About OpenSLR Dataset

This dataset is a high-quality Hindi multi-speaker speech dataset from OpenSLR
The Hindi speech dataset is split into train and test sets with 95.05 hours and 5.55 hours of audio respectively
There are 4506 and 386 unique sentences taken from Hindi stories in the train and test sets, respectively, with no overlap of sentences. The train set contains utterances from a set of 59 speakers, and the test set contains speakers from a disjoint set of 19 speakers
The audio files are sampled at 8kHz, 16-bit encoding. The total vocabulary size of the train and test set is 6542

3.2 About IIIT-Hyderabad Dataset

This dataset consists of single-speaker samples from IIIT-Hyd
There are 9368 samples available for training
The data has a sampling rate of 48kHz

3.3 Download and Extract OpenSLR Dataset

# Download train dataset
wget https://www.openslr.org/resources/103/Hindi_train.tar.gz

# Extract train dataset
tar xvf Hindi_train.tar.gz

# Download test dataset
wget https://www.openslr.org/resources/103/Hindi_test.tar.gz

# Extract test dataset
tar xvf Hindi_test.tar.gz

# Copy the data to dataset folder
mkdir HindiDataset
mv train HindiDataset/
mv test HindiDataset/

3.4 Download and Extract IIT-Hyd Dataset

Request for the dataset here.

# Copy the data to dataset folder
mkdir HindiDataset
mv Dataset HindiDataset/train_raw

4. Data Preprocessing

4.1 OpenSLR

OpenSLR data consists of transcription.txt file. This needs to be converted into the format compatible with tacotron2 training.

4.1.1 Upsample to 22050 Hz

python3 upsampler.py HindiDataset/train/audio/
python3 upsampler.py HindiDataset/test/audio/

4.1.2 Creating train and test text files

Run filelist_creator.py to create text files for training

python3 filelist_creator.py HindiDataset/

4.1.3 Update hparams.py

Open this file and change training_files and validation_files accordingly

training_files='filelists/openslr_hindi_train.txt',
validation_files='filelists/openslr_hindi_test.txt',

4.2 IIIT-Hyd

This dataset provides an annotations.csv file. This needs to be converted into the format compatible with tacotron2 training.

4.2.1 Downsample the data to 22050 Hz

mkdir -p HindiDataset/train
python3 format_changer.py HindiDataset/train_raw/ HindiDataset/train/

4.2.2 Create train and test text files

cp annotations.csv filelists/iiit-hyd_hindi_train.txt
cp annotations.csv filelists/iiit-hyd_hindi_test.txt

4.2.3 Update hparams.py

Open this file and change training_files and validation_files accordingly

training_files='filelists/iiit-hyd_hindi_train.txt',
validation_files='filelists/iiit-hyd_hindi_test.txt',

5. Training

5.1 Training using a pre-trained model (recommended)

Training using a pre-trained model can lead to faster convergence By default, the dataset dependent text embedding layers are ignored

Download NVIDIA pre-trained Tacotron 2 model

python3 train.py --output_directory=outdir --log_directory=logdir -c tacotron2_statedict.pt --warm_start

5.2 Training from scratch

You can train a model from scratch

python3 train.py --output_directory=outdir --log_directory=logdir

5.3 Multi-GPU (distributed) and Automatic Mixed Precision Training

python3 -m multiproc train.py --output_directory=outdir --log_directory=logdir --hparams=distributed_run=True,fp16_run=True

5.4 View Tensorboard

Model accuracy and alignment can be easily monitored using tensorboard

tensorboard --logdir=outdir/logdir

6. Inference demo

6.1 Download pre-trained models

Download waveglow from here
Download pre-trained openslr hindi from here
Download pre-trained iiit-hyd hindi from here

N.b. When performing Mel-Spectrogram to Audio synthesis, make sure Tacotron 2 and the Mel decoder were trained on the same mel-spectrogram representation.

6.2 Run jupyter notebook

jupyter notebook --ip=127.0.0.1 --port=31337

Run inference.ipynb.

6.3 Multiple Vocoders

The vocoders supported in this repository are WaveGlow, MelGAN and HiFiGAN.

First install NeMo

git clone https://github.com/NVIDIA/NeMo.git
cd NeMo
./reinstall.sh

Use the inference.ipynb notebook to use different vocoders.

7. Related repos

NVIDIA/tacotron2 Original work this repository is inspired from.

WaveGlow Faster than real time Flow-based Generative Network for Speech Synthesis

nv-wavenet Faster than real time WaveNet.

8. Acknowledgements

This implementation uses code from the following repos: Keith Ito, Prem Seetharaman as described in our code.

We are inspired by Ryuchi Yamamoto's Tacotron PyTorch implementation.

We are thankful to the Tacotron 2 paper authors, specially Jonathan Shen, Yuxuan Wang and Zongheng Yang.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
filelists		filelists
resources		resources
text		text
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
audio_processing.py		audio_processing.py
data_utils.py		data_utils.py
demo.wav		demo.wav
distributed.py		distributed.py
filelist_creator.py		filelist_creator.py
format_changer.py		format_changer.py
glow.py		glow.py
hparams.py		hparams.py
inference.ipynb		inference.ipynb
layers.py		layers.py
logger.py		logger.py
loss_function.py		loss_function.py
loss_scaler.py		loss_scaler.py
model.py		model.py
multiproc.py		multiproc.py
plotting_utils.py		plotting_utils.py
requirements.txt		requirements.txt
stft.py		stft.py
tensorboard.png		tensorboard.png
train.py		train.py
upsampler.py		upsampler.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Text to Speech Synthesis for Indic Languages using Tacotron2

About

Demo

Index

1. Pre-requisites

2. Setup

2.1 Torch from binary

2.2 Apex

2.3 Other Python requirements

3. Dataset

3.1 About OpenSLR Dataset

3.2 About IIIT-Hyderabad Dataset

3.3 Download and Extract OpenSLR Dataset

3.4 Download and Extract IIT-Hyd Dataset

4. Data Preprocessing

4.1 OpenSLR

4.2 IIIT-Hyd

5. Training

5.1 Training using a pre-trained model (recommended)

5.2 Training from scratch

5.3 Multi-GPU (distributed) and Automatic Mixed Precision Training

5.4 View Tensorboard

6. Inference demo

6.1 Download pre-trained models

6.2 Run jupyter notebook

6.3 Multiple Vocoders

7. Related repos

8. Acknowledgements

About

Releases

Packages

Contributors 2

Languages

License

Aasthaengg/Text2SpeechSynthesis-IndicLanguages

Folders and files

Latest commit

History

Repository files navigation

Text to Speech Synthesis for Indic Languages using Tacotron2

About

Demo

Index

1. Pre-requisites

2. Setup

2.1 Torch from binary

2.2 Apex

2.3 Other Python requirements

3. Dataset

3.1 About OpenSLR Dataset

3.2 About IIIT-Hyderabad Dataset

3.3 Download and Extract OpenSLR Dataset

3.4 Download and Extract IIT-Hyd Dataset

4. Data Preprocessing

4.1 OpenSLR

4.2 IIIT-Hyd

5. Training

5.1 Training using a pre-trained model (recommended)

5.2 Training from scratch

5.3 Multi-GPU (distributed) and Automatic Mixed Precision Training

5.4 View Tensorboard

6. Inference demo

6.1 Download pre-trained models

6.2 Run jupyter notebook

6.3 Multiple Vocoders

7. Related repos

8. Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages