INTERSPEECH 2018 paper: link
We apply the capsule network to capture the spatial relationship and pose information of speech spectrogram features in both frequency and time axes, and show that our proposed end-to-end SR system with capsule networks on one-second speech commands dataset achieves better results on both clean and noise-added test than baseline CNN models.
- 20 JAN 2019: Other baseline Keyword Spotting(KWS) models are also provided in CNN code.
The code is implemented based on python2(2.7.12)
You should be ready to import below libraries:
tqdm, numpy(1.14.1), termcolor, scipy, sklearn, scikits
tensorflow(1.6.0), keras(2.1.4)
pip install numpy
pip install termcolor
pip install scipy
pip install sklearn
pip install scikit-learn
pip install tensorflow-gpu==1.6.0
pip install keras==2.1.4
We use 'Google Speech Command Dataset'. You could refer to blog and Download Link
- Download the dataset from above link and unzip it. (In our case we will unzip it in the folder named 'Google_Speech_Command')
To add noise to the original dataset, we use MATLAB and voicebox which is MATLAB library. We run matlab code on local which is window base and upload it to server which is linux base.
-
Unzip download google speech command dataset.
-
Create new folder name 'Google_Speech_Command' and move command folders into it. Then the folder structure will be like
speech_commands_v0.01.tar
|-- [_backgorund_noise_]
|-- Google_Speech_Command
| |-- bed
| |-- bird
: :
| '-- zero
|-- testing_list
|-- validation_list
'-- etc.
- Change 'data_path' in matlab code and run the matlab code. It will generate new folder and save the noise added audio files.
noise_wave_generate.m
- You could aslo change 'SNR' in the code and generate noise audio files as you want.
Extract speech features from raw audio file and save them as .npy file. Please adjust '--noise_name' argument.
cd core
python feature_generation.py
feature_saved
|-- TEST
| |-- fbank
| | |-- clean
| | '-- [noise names]_SNR5
| '-- label
|-- TRAIN
| |-- fbank
| | |-- clean
| | '-- [noise names]_SNR5
| '-- label
'-- VALID
|-- fbank
| |-- clean
| '-- [noise names]_SNR5
'-- label
For training and testing go into 'CNN' or 'CapsNet' folder and run the code. You could change the mode with '--is_training' argument.
cd CapsNet
python main.py -m=CapsNet --is_training='TRAIN' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32 --primary_veclen=4 --digit_veclen=4
Note that you should set '--keep' argument to the number of epoch that you want to test.
cd CapsNet
python main.py -m=CapsNet --is_training='TEST' -ex='0320_digitvec4' -d=0 --kernel=19 --primary_channel=32 --primary_veclen=4 --digit_veclen=4 --SNR=5 --keep=?
KWS models based on various kinds of Neural Networks(NNs) are also provided in CNN/model.py
1. Deep Neural Network(DNN) base KWS model from
- G. Chen, C. Parada, and G. Heigold, “Small-footprint keyword spotting using deep neural networks.” in ICASSP, vol. 14. Citeseer, 2014, pp. 4087–4091.
Contain 'ref_2014icassp_dnn' in ex_name to use DNN model. For example
```
python main.py --model='CNN' --ex_name='ref_2014icassp_dnn512' --is_training='TRAIN' --model_size_info 512 512 512
```
2. CNN base KWS model from
- T. N. Sainath and C. Parada, “Convolutional neural networks for small-footprint keyword spotting,” in Sixteenth Annual Conference of the International Speech Communication Association, 2015.
Contain 'ref_2015is_cnn' in ex_name to use CNN model. For example
```
python main.py --model='CNN' --ex_name='ref_2015is_cnn' --is_training='TRAIN' --model_size_info 21 8 94 1 1 2 3 6 4 94 1 1 1 1 32
```
3. Long Short-Term Memory(LSTM) base KWS model form
- M. Sun, A. Raju, G. Tucker, S. Panchapagesan, G. Fu, A. Mandal, S. Matsoukas, N. Strom, and S. Vitaladevuni, “Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting,” in Spoken Language Technology Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 474–480.
Contain 'ref_rnn' in ex_name to use LSTM model. For example
```
python main.py --model='CNN' -ex_name=ref_rnn_lstm --is_training='TRAIN' --model_size_info 64 32 0
```
4. Convolutional Recurrent Neural Network(CRNN) base KWS model from
- S. O. Arik, M. Kliegl, R. Child, J. Hestness, A. Gibiansky, C. Fougner, R. Prenger, and A. Coates, “Convolutional recurrent neural networks for small-footprint keyword spotting,” arXiv preprint arXiv:1703.05390, 2017.
Contain 'ref_crnn' in ex_name to use CRNN model. For example
```
python main.py --model='CNN' --ex_name=ref_crnn --is_training='TRAIN' --model_size_info 32 20 5 8 2 2 32 1 64
```
Preprocessing source code from https://github.com/zzw922cn/Automatic_Speech_Recognition.
Base capsule network keras source code from https://github.com/XifengGuo/CapsNet-Keras.
Jaesung Bae - Korea Advanced Institute of Science and Technology (KAIST)
contact: [email protected]