- Setup
- Text Preprocessing (Phonetical Conversion and Normalization for Turkish)
- Data Preperation
- Training Fastpitch from scratch (Spectrogram Generator)
- Fine-tuning the model with HiFi-GAN (Waveforms Generator)
- Inference
This repository contains a Dockerfile that extends the PyTorch 21.02-py3 NGC container and encapsulates some dependencies. To create your own container, choose a PyTorch container from NVIDIA PyTorch Container Versions and create a Dockerfile as following format:
FROM nvcr.io/nvidia/pytorch:21.02-py3
WORKDIR /path/to/working/directory/text2speech/
COPY requirements.txt .
RUN pip install -r requirements.txt
- Build and run docker
Go to the /path/to/working/directory/text2speech/docker
$ docker build --no-cache -t torcht2s .
$ docker run -it --rm --gpus all -p 2222:8888 -v /path/to/working/directory/text2speech:/path/to/working/directory/text2speech torcht2s
- Add environment to jupyter notebook and launch jupyter notebook
$ python -m ipykernel install --user --name=torcht2s
$ jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
- Open a browser from your local machine and navigate to
http://127.0.0.1:2222/?token=${TOKEN}
and enter your token specified in your terminal.
In order to train speech synthesis models, sounds and phoneme sequences expressing sounds are needed. That's wyh in the first step, the input text is encoded into a list of symbols. In this study, we will use Turkish characters and phonemes as the symbols. Since Turkish is a phonetic language, words are expressed as they are read. That is, character sequences are constructed words in Turkish. In non-phonetic languages such as English, words can be expressed with phonemes. To synthesize Turkish speech with English data, the words in the English dataset first must be phonetically translated into Turkish.
- In this study, cmudict_tr and heteronyms_tr were used. CMUDict (Turkish phonetic lexicon) is a dictionary that phonetically expresses about 1.5M words in Turkish.
- The following phonemes represent the Turkish pronunciation of the phonemes.
valid_symbols = ['1', '1:', '2', '2:', '5', 'a', 'a:', 'b', 'c', 'd', 'dZ', 'e', 'e:', 'f', 'g', 'gj', 'h', 'i', 'i:', 'j',
'k', 'l', 'm', 'n', 'N', 'o', 'o:', 'p', 'r', 's', 'S', 't', 'tS', 'u', 'u', 'v', 'y', 'y:', 'z', 'Z']
- Text normalization converts text from written form into its verbalized form, and it is an essential preprocessing step before text-to-speech synthesis. It ensures that TTS can handle all input texts without skipping unknown symbols. Text normalization is applied for Turkish utterances.
To speed-up training, those could be generated during the pre-processing step and read directly from the disk during training. Follow these steps to use custom dataset.
- Prepare a directory with .wav files, filelists (training/validation split of the data) with transcripts and paths to .wav files under the
text2speech/Fastpitch/dataset/
location. Those filelists should list a single utterance per line as:
<audio file path>|<transcript>
- Run the pre-processing script to calculate pitch and mels with
text2speech/Fastpitch/data_preperation.ipynb
$ python prepare_dataset.py \
--wav-text-filelists dataset/tts_data.txt \
--n-workers 16 \
--batch-size 1 \
--dataset-path dataset \
--extract-pitch \
--f0-method pyin \
--extract-mels \
- Prepare file lists with paths to pre-calculated pitch running
create_picth_text_file(manifest_path)
fromtext2speech/Fastpitch/data_preperation.ipynb
Those filelists should list a single utterance per line as:
<mel or wav file path>|<pitch file path>|<text>|<speaker_id>
The complete dataset has the following structure:
./dataset
├── mels
├── pitch
├── wavs
├── tts_data.txt # train + val
├── tts_data_train.txt
├── tts_data_val.txt
├── tts_pitch_data.txt # train + val
├── tts_pitch_data_train.txt
├── tts_pitch_data_val.txt
The training will produce a FastPitch model capable of generating mel-spectrograms from raw text. It will be serialized as a single .pt
checkpoint file, along with a series of intermediate checkpoints.
$ python train.py --cuda --amp --p-arpabet 1.0 --dataset-path dataset \
--output saved_fastpicth_models/ \
--training-files dataset/tts_pitch_data_train.txt \
--validation-files dataset/tts_pitch_data_val.txt \
--epochs 1000 --learning-rate 0.001 --batch-size 32 \
--load-pitch-from-disk
The last step is converting the spectrogram into the waveform. The process to generate speech from spectrogram is also called Vocoder.
Some mel-spectrogram generators are prone to model bias. As the spectrograms differ from the true data on which HiFi-GAN was trained, the quality of the generated audio might suffer. In order to overcome this problem, a HiFi-GAN model can be fine-tuned on the outputs of a particular mel-spectrogram generator in order to adapt to this bias. In this section we will perform fine-tuning to FastPitch outputs.
- Generate mel-spectrograms for all utterances in the dataset with the FastPitch model
- Copy best-performed FastPitch output .pt file in the
text2speech/Hifigan/data/pretrained_fastpicth_model/
directory. - Copy manifest file
tts_pitch_data.txt
in thetext2speech/Hifigan/data/
directory.
$ python extract_mels.py --cuda
-o data/mels-fastpitch-tr22khz \
--dataset-path /text2speech/Fastpitch/dataset \
--dataset-files data/tts_pitch_data.txt # train + val
--load-pitch-from-disk \
--checkpoint-path data/pretrained_fastpicth_model/FastPitch_checkpoint.pt -bs 16
Mel-spectrograms should now be prepared in the text2speech/Hifigan/data/mels-fastpitch-tr22khz
directory.
The fine-tuning script will load an existing HiFi-GAN model and run several epochs of training using spectrograms generated in the last step.
- Fine-tune the Fastpitch model with HiFi-GAN
This step will produce another .pt
HiFi-GAN model checkpoint file fine-tuned to the particular FastPitch model.
- Open a new folder
results
in thetext2speech/Hifigan
directory.
$ nohup python train.py --cuda --output /results/hifigan_tr22khz \
--epochs 1000 --dataset_path /Fastpitch/dataset \
--input_mels_dir /data/mels-fastpitch-tr22khz \
--training_files /Fastpitch/dataset/tts_data.txt \
--validation_files /Fastpitch/dataset/tts_data.txt \
--fine_tuning --fine_tune_lr_factor 3 --batch_size 16 \
--learning_rate 0.0003 --lr_decay 0.9998 --validation_interval 10 > log.txt
- Open another terminal and track log as following
$ tail -f log.txt
Run the following command to synthesize audio from raw text with mel-spectrogram generator
python inference.py --cuda \
--hifigan /Hifigan/results/hifigan_tr22khz/hifigan_gen_checkpoint.pt \
--fastpitch /Fastpitch/saved_fastpicth_models/FastPitch_checkpoint.pt \
-i test_text.txt \
-o wavs/
The speech is generated from a file passed with the -i
argument.
The output audio will be stored in the path specified by the -o
argument.