-
Notifications
You must be signed in to change notification settings - Fork 6
Open
Description
-
Original repo doesn't provide configurable min_term_length (see this issue) and it is hardcoded as 3 here. We need it especially for continuous script languages, like Japanese, Chinese and Thai, where one symbol can represent one word.
-
Japanese language has words with symbols in the end, like
コンピューター(computer),テレビー(television),ツイッター(twitter) and so on. Yake uses web_tokenizer() from segtok.tokenizer which separatesーin the end. So for continuous script languages we apply tokenization explicitly before passing text to Yake and introduce condition where tokenizer in Yake only splits words by space.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels