Skip to content

fix: Words ending with 's' are incorrectly tokenized. #9310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

liuzhenghua
Copy link
Contributor

What problem does this PR solve?

Issue:
English words such as has and Doris are currently incorrectly reduced to ha and dori.

Change:
The stemmer performs rule-based stemming without validating whether the resulting word is actually correct. In contrast, the lemmatizer uses part-of-speech information to produce accurate base forms. Since only one of them is needed, I have removed the stemmer in favor of the lemmatizer.

This update requires adding the files downloaded via nltk.download('averaged_perceptron_tagger_eng') to the Docker image, placing them under the /root/nltk_data/taggers directory.

**问题:**英文单词,例如 has, doris,目前会被错误分成ha, dori。
**修改项:**stemmer只是简单的按照正则进行词形还原,不会管还原后的词是否正确。lemmatizer则可以根据词性去进行词形还原,两者选其一就行了,所以我这里把stemmer进行了移除。

本次修改项,dockerfile 需要添加nltk.download('averaged_perceptron_tagger_eng')下载的文件,放到/root/nltk_data/taggers目录

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. 🐞 bug Something isn't working, pull request that fix bug. labels Aug 7, 2025
@whhe
Copy link
Contributor

whhe commented Aug 19, 2025

Besides Dockerfile, I'm not sure about Dockerfile.scratch.oc9, but download_deps.py also needs to be updated.
And the code comments here should be in English, which need to be consistent with other places.

And I have another question, what would happen in tokenize if all other languages like Japanese and French were treated as English?

@liuzhenghua
Copy link
Contributor Author

Besides Dockerfile, I'm not sure about Dockerfile.scratch.oc9, but download_deps.py also needs to be updated. And the code comments here should be in English, which need to be consistent with other places.

And I have another question, what would happen in tokenize if all other languages like Japanese and French were treated as English?

The WordNetLemmatizer only supports the English lexicon. For other languages, it simply returns the original word. To support French, you need to use spaCy.

@whhe
Copy link
Contributor

whhe commented Aug 19, 2025

Besides Dockerfile, I'm not sure about Dockerfile.scratch.oc9, but download_deps.py also needs to be updated. And the code comments here should be in English, which need to be consistent with other places.
And I have another question, what would happen in tokenize if all other languages like Japanese and French were treated as English?

The WordNetLemmatizer only supports the English lexicon. For other languages, it simply returns the original word. To support French, you need to use spaCy.

Thanks for your reply. So in fact, the current cross language search only supports Chinese and English, right? It's a bit different from what I expected before.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 bug Something isn't working, pull request that fix bug. size:M This PR changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants