fix: Words ending with 's' are incorrectly tokenized. #9310
+65
−6
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
Issue:
English words such as has and Doris are currently incorrectly reduced to ha and dori.
Change:
The stemmer performs rule-based stemming without validating whether the resulting word is actually correct. In contrast, the lemmatizer uses part-of-speech information to produce accurate base forms. Since only one of them is needed, I have removed the stemmer in favor of the lemmatizer.
This update requires adding the files downloaded via nltk.download('averaged_perceptron_tagger_eng') to the Docker image, placing them under the /root/nltk_data/taggers directory.
**问题:**英文单词,例如 has, doris,目前会被错误分成ha, dori。
**修改项:**stemmer只是简单的按照正则进行词形还原,不会管还原后的词是否正确。lemmatizer则可以根据词性去进行词形还原,两者选其一就行了,所以我这里把stemmer进行了移除。
本次修改项,dockerfile 需要添加
nltk.download('averaged_perceptron_tagger_eng')
下载的文件,放到/root/nltk_data/taggers
目录Type of change