diff --git a/wiki/diphones.md b/wiki/diphones.md index 027a99d..975fc54 100644 --- a/wiki/diphones.md +++ b/wiki/diphones.md @@ -5,6 +5,35 @@ A *diphone* is the last part of one phoneme followed by the first part of another. Either phoneme could be silence, and they can be the same phoneme. Diphthongs include diphones in them. +There are 2 variants of the diphone alignment system in PocketSphinx. + +The first one is **synthetic**, which builds diphone units automatically +from context-dependent phone units (triphones) after reading their +definition from `mdef` file. More precisely, it is done as follows: +for each possible pair of base phones, it tries to find 2 triphones: +one triphone that has the first base phone as main phone and the second +base phone as RC (right context), and another triphone that has the +first base phone as LC (left context) and the second base phone as +main phone. Then it takes the last 1 senone from the first found +triphone and the first 2 senones from the second found triphone. +Use `-diphones=synthetic` command line parameter to enable this variant. + +The second one is **trained**, which uses pretrained acoustic model +where diphones are defined as context-independent units. +The model was trained on the "clean" subset of +[LibriSpeech](http://www.openslr.org/12/) ASR corpus and +contains 899 diphones. It requires the dictionary to use diphone +units as well. The version of CMU Sphinx `en-us` dictionary +with diphones was created with +[this](https://github.com/akreal/diphones/blob/master/scripts/dict.py) +script and it is also used automatically when the trained +diphones acoustic model is chosen. Use `-diphones=trained` command +line parameter to enable this variant. + +Additionally, you can use `-diphones=yes` command line parameter, +which is currently an alias for the trained variant of +the diphones alignment system. + [This list of the top 4,800 words by frequency in English speech](http://ucrel.lancs.ac.uk/bncfreq/lists/2_2_spokenvwritten.txt) was used with [CMUDICT](http://www.speech.cs.cmu.edu/cgi-bin/cmudict) to create the @@ -13,6 +42,7 @@ by approximate prevalence. ![diphones](/data/diphones.png) +``` UH_R 2.376%, AH_N 2.083%, T_SIL 1.863%, @@ -1065,3 +1095,4 @@ ZH_V 0.003%, ZH_W 0.003%, ZH_Y 0.003%, ZH_Z 0.003%. +```