Skip to content

Latest commit

 

History

History
36 lines (26 loc) · 1.17 KB

compile.md

File metadata and controls

36 lines (26 loc) · 1.17 KB

Compiling system dictionaries

This document describes how to compile system dictionaries, assuming you are in the root directory of this repository.

You can compile system dictionaries from language resources in the MeCab format. The simplest way is using publicly-available resources such as UniDic or IPADIC.

Here, consider to use unidic-mecab-2.1.2.

$ wget "https://clrd.ninjal.ac.jp/unidic_archive/cwj/2.1.2/unidic-mecab-2.1.2_src.zip" -O "./unidic-mecab-2.1.2_src.zip" --no-check-certificate
$ unzip unidic-mecab-2.1.2_src.zip

To compile the system dictionary from the resource, run the following command.

$ cargo run --release -p compile -- \
    -l unidic-mecab-2.1.2_src/lex.csv \
    -m unidic-mecab-2.1.2_src/matrix.def \
    -u unidic-mecab-2.1.2_src/unk.def \
    -c unidic-mecab-2.1.2_src/char.def \
    -o system.dic.zst

Instead of using publicly-available trained resources, you can manually train parameters from your own corpus in the manner described in train.md.

Accelerating your dictionaries

Vibrato supports editing your dictionary to achieve faster tokenization. See map.md.