Support for continuous script languages

1. Original repo doesn't provide configurable min_term_length (see [this issue](https://github.com/LIAAD/yake/issues/68)) and it is hardcoded as 3 [here](https://github.com/LIAAD/yake/blob/master/yake/datarepresentation.py#L159). We need it especially for continuous script languages, like Japanese, Chinese and Thai, where one symbol can represent one word.

2. Japanese language has words with symbols in the end, like `コンピューター` (computer), `テレビー` (television), `ツイッター` (twitter) and so on. Yake uses [web_tokenizer() from segtok.tokenizer](https://github.com/showheroes/yake/blob/master/yake/datarepresentation.py#L50) which separates `ー` in the end. So for continuous script languages we apply tokenization explicitly before passing text to Yake and introduce condition where tokenizer in Yake only splits words by space.

- [Example solution](https://github.com/showheroes/yake/pull/1) in the LIAAD-forked repo
- https://github.com/LIAAD/yake/issues/87
- https://github.com/LIAAD/yake/issues/40
- https://github.com/LIAAD/yake/issues/41
- https://github.com/LIAAD/yake/issues/58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for continuous script languages #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Support for continuous script languages #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions