Skip to content

Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support. (Spell checking derived from hfst-ospell)

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
GPL-3.0
LICENSE-GPL
MIT
LICENSE-MIT
Notifications You must be signed in to change notification settings

divvun/divvunspell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Mar 30, 2025
68c2d72 · Mar 30, 2025
Oct 31, 2024
Nov 13, 2024
Mar 30, 2025
Mar 30, 2025
Mar 30, 2025
Mar 24, 2025
Mar 30, 2025
May 7, 2024
Nov 19, 2024
Sep 25, 2019
Nov 28, 2022
Mar 30, 2025
Mar 21, 2025
Dec 6, 2023
Dec 13, 2023
Dec 6, 2023
Jan 21, 2025
Jan 17, 2022

divvunspell

An implementation of hfst-ospell in Rust, with added features like tokenization, case handling, and parallelisation.

CI

Building and installing commandline tools

# For the `divvunspell` binary:
cargo install divvunspell-bin

# For `thfst-tools` binary (most people can skip this one):
cargo install thfst-tools

# To build the development version from this source, cd into the relevant directory and:
cargo install --path .

Building with gpt2 support on macOS aarch64

(Skip this if you are not experimenting with gpt2 support. So skip. Now.)

Clone this repo then:

brew install libtorch
LIBTORCH=/opt/homebrew/opt/libtorch cargo build --features gpt2 --bin divvunspell

No Rust?

curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
rustup default stable
cargo build --release

divvunspell

Usage:

Usage: divvunspell SUBCOMMAND [OPTIONS]

Optional arguments:
  -h, --help  print help message

Available subcommands:
  suggest   get suggestions for provided input
  tokenize  print input in word-separated tokenized form
  predict   predict next words using GPT2 model

$ divvunspell suggest -h
Usage: divvunspell suggest [OPTIONS]

Positional arguments:
  inputs                 words to be processed

Optional arguments:
  -h, --help             print help message
  -a, --archive ARCHIVE  BHFST or ZHFST archive to be used
  -S, --always-suggest   always show suggestions even if word is correct
  -w, --weight WEIGHT    maximum weight limit for suggestions
  -n, --nbest NBEST      maximum number of results
  --no-reweighting       disables reweighting algorithm (makes results more like hfst-ospell)
  --no-recase            disables recasing algorithm (makes results more like hfst-ospell)
  --json                 output in JSON format

If you want to debug divvunspell behaviour, simply enable rust's logging features by setting RUST_LOG=trace on your commandline's environment variables.

accuracy

Building:

cd accuracy/
cargo install --path .

The resulting binary accuracy is placed in $HOME/.cargo/bin/, make sure it is on the path.

Usage:

divvunspell-accuracy 1.0.0-beta.1
Accuracy testing for DivvunSpell.

USAGE:
    accuracy [OPTIONS] [ARGS]

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

OPTIONS:
    -c <config>             Provide JSON config file to override test defaults
    -o <JSON-OUTPUT>        The file path for the JSON report output
    -w <max-words>          Truncate typos list to max number of words specified
    -t <TSV-OUTPUT>         The file path for the TSV line append

ARGS:
    <WORDS>    The 'input -> expected' list in tab-delimited value file (TSV)
    <ZHFST>    Use the given ZHFST file

thfst-tools

Convert hfst and zhfst files to thfst and bhfst formats.

  • thfst: byte-aligned hfst for fast and efficient loading and memory mapping, required to run divvunspell on ARM processors
  • bhfst: thfst files wrapped in a box container; in the case of zhfst files converted to bhfst, the metadata file (index.xml in the zhfst archive) is converted to a json file for faster and leaner processing by the divvunspell library.

Usage:

thfst-tools 1.0.0-alpha.5
Tromsø-Helsinki Finite State Transducer toolkit.

USAGE:
    thfst-tools <SUBCOMMAND>

FLAGS:
    -h, --help       Prints help information
    -V, --version    Prints version information

SUBCOMMANDS:
    bhfst-info         Print metadata for BHFST
    help               Prints this message or the help of the given subcommand(s)
    hfst-to-thfst      Convert an HFST file to THFST
    thfsts-to-bhfst    Convert a THFST acceptor/errmodel pair to BHFST
    zhfst-to-bhfst     Convert a ZHFST file to BHFST

Speller testing

There's a prototype-level testing tool in support/accuracy-viewer. Use it like:

accuracy -o support/accuracy-viewer/public/report.json typos.txt sma.zhfst
cd support/accuracy-viewer
npm i && npm run dev

View in http://localhost:5000.

typos.txt is a TSV file with typos in the first column and expected correction in the second. More info by accuracy --help.

License

The crate divvunspell is licensed under either of

at your option.

The divvunspell, thfst-tools and accuracy binaries are licensed under the GPL version 3 license.

More docs?

We have GitHub pages site for divvunspell with some more tech docs and stuff (WIP).

About

Spell checking library for ZHFST/BHFST spellers, with case handling and tokenization support. (Spell checking derived from hfst-ospell)

Topics

Resources

License

Apache-2.0 and 2 other licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
GPL-3.0
LICENSE-GPL
MIT
LICENSE-MIT

Stars

Watchers

Forks