fix: Words ending with 's' are incorrectly tokenized. #9310

liuzhenghua · 2025-08-07T09:01:19Z

What problem does this PR solve?

Issue:
English words such as has and Doris are currently incorrectly reduced to ha and dori.

Change:
The stemmer performs rule-based stemming without validating whether the resulting word is actually correct. In contrast, the lemmatizer uses part-of-speech information to produce accurate base forms. Since only one of them is needed, I have removed the stemmer in favor of the lemmatizer.

This update requires adding the files downloaded via nltk.download('averaged_perceptron_tagger_eng') to the Docker image, placing them under the /root/nltk_data/taggers directory.

**问题：**英文单词，例如 has, doris，目前会被错误分成ha, dori。
**修改项：**stemmer只是简单的按照正则进行词形还原，不会管还原后的词是否正确。lemmatizer则可以根据词性去进行词形还原，两者选其一就行了，所以我这里把stemmer进行了移除。

本次修改项，dockerfile 需要添加nltk.download('averaged_perceptron_tagger_eng')下载的文件，放到/root/nltk_data/taggers目录

Type of change

Bug Fix (non-breaking change which fixes an issue)
New Feature (non-breaking change which adds functionality)
Documentation Update
Refactoring
Performance Improvement
Other (please describe):

whhe · 2025-08-19T06:47:08Z

Besides Dockerfile, I'm not sure about Dockerfile.scratch.oc9, but download_deps.py also needs to be updated.
And the code comments here should be in English, which need to be consistent with other places.

And I have another question, what would happen in tokenize if all other languages like Japanese and French were treated as English?

liuzhenghua · 2025-08-19T06:57:57Z

Besides Dockerfile, I'm not sure about Dockerfile.scratch.oc9, but download_deps.py also needs to be updated. And the code comments here should be in English, which need to be consistent with other places.

And I have another question, what would happen in tokenize if all other languages like Japanese and French were treated as English?

The WordNetLemmatizer only supports the English lexicon. For other languages, it simply returns the original word. To support French, you need to use spaCy.

whhe · 2025-08-19T09:45:32Z

Besides Dockerfile, I'm not sure about Dockerfile.scratch.oc9, but download_deps.py also needs to be updated. And the code comments here should be in English, which need to be consistent with other places.
And I have another question, what would happen in tokenize if all other languages like Japanese and French were treated as English?

The WordNetLemmatizer only supports the English lexicon. For other languages, it simply returns the original word. To support French, you need to use spaCy.

Thanks for your reply. So in fact, the current cross language search only supports Chinese and English, right? It's a bit different from what I expected before.

fix: Words ending with 's' are incorrectly tokenized.

88f465b

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working, pull request that fix bug. labels Aug 7, 2025

This was referenced Aug 19, 2025

fix: skip PorterStemmer for words ending with the 'e' #9544

Closed

Fix: English tokens extraction error when word ends with "e" #9546

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Words ending with 's' are incorrectly tokenized. #9310

fix: Words ending with 's' are incorrectly tokenized. #9310

liuzhenghua commented Aug 7, 2025

Uh oh!

whhe commented Aug 19, 2025

Uh oh!

liuzhenghua commented Aug 19, 2025

Uh oh!

whhe commented Aug 19, 2025

Uh oh!

Uh oh!

fix: Words ending with 's' are incorrectly tokenized. #9310

Are you sure you want to change the base?

fix: Words ending with 's' are incorrectly tokenized. #9310

Conversation

liuzhenghua commented Aug 7, 2025

What problem does this PR solve?

Type of change

Uh oh!

whhe commented Aug 19, 2025

Uh oh!

liuzhenghua commented Aug 19, 2025

Uh oh!

whhe commented Aug 19, 2025

Uh oh!

Uh oh!