Skip to content

📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.

License

Notifications You must be signed in to change notification settings

charlesliucn/LanMIT

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Low-resourced Language Modeling based on Kaldi

This repository provides Kaldi users with a few useful scripts for language modeling, especially for low-resourced conditions. The scripts are mainly based on babel/s5d in egs directory.

Most of the scripts are in babel/s5d and wsj/s5/steps.

Currently, the scripts are not so well organized. A document of detailed usage of these scripts will be added later.

image


Main Contributions

  • Data Augmentation
    • Text Preprocessing for Lexicon Generation
    • Vocabulary Expansion Based on Word Frequency
    • Data Selection Based on Multiple Criteria
  • N-Gram Language Models based on SRILM
    • Linear Interpolation for N-Gram models
    • N-Gram Language Model for Rescoring
  • LSTM Language Model Based on Tensorflow
    • Word Vectors Pre-training for RNN/LSTM Language Model Training
    • LSTM Language Model for Rescoring

Relevant Toolkits

  • XenC: an open-source tool for data selection in Natural Language Processing.
  • GloVe: Global Vectors for Word Representation.
  • SRILM: an Extensible Language Modeling Toolkit.

Contact

Any questions please send e-mails to [email protected].


More info about Kaldi Speech Recognition Toolkit, please see Kaldi's official github repository.

About

📖 LanMIT: A Toolkit for Improving Language Models in Low-resourced Speech Recognition based on Kaldi.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 57.2%
  • Shell 20.6%
  • Python 11.2%
  • Perl 5.0%
  • C 2.1%
  • TeX 2.1%
  • Other 1.8%