-
Notifications
You must be signed in to change notification settings - Fork 2k
[BUG] BM25 does not work when multithreading #5993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] BM25 does not work when multithreading #5993
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
|
Fix BM25 stemmer thread-safety by avoiding shared instances This PR resolves the concurrency bug in Key Changes• Replaced the cached Affected Areas• chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py This summary was automatically generated by @propel-code-bot |
| stemmer = get_english_stemmer() | ||
| tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length) | ||
| tokens = tokenizer.tokenize(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Maintainability] This change correctly addresses the thread-safety issue. To prevent future regressions where a developer might see this as a performance issue and try to 'optimize' it by re-introducing caching, it would be beneficial to add a comment explaining why a new stemmer and tokenizer are created on each call. This documents the non-obvious requirement that snowballstemmer instances are not thread-safe.
| stemmer = get_english_stemmer() | |
| tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length) | |
| tokens = tokenizer.tokenize(text) | |
| # A new stemmer and tokenizer are created for each call because the underlying | |
| # snowballstemmer is not thread-safe and cannot be shared across threads. | |
| stemmer = get_english_stemmer() | |
| tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length) | |
| tokens = tokenizer.tokenize(text) |
Context for Agents
This change correctly addresses the thread-safety issue. To prevent future regressions where a developer might see this as a performance issue and try to 'optimize' it by re-introducing caching, it would be beneficial to add a comment explaining *why* a new stemmer and tokenizer are created on each call. This documents the non-obvious requirement that `snowballstemmer` instances are not thread-safe.
```suggestion
# A new stemmer and tokenizer are created for each call because the underlying
# snowballstemmer is not thread-safe and cannot be shared across threads.
stemmer = get_english_stemmer()
tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)
tokens = tokenizer.tokenize(text)
```
File: chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py
Line: 84|
|
||
|
|
||
| @lru_cache(maxsize=1) | ||
| def get_english_stemmer() -> SnowballStemmer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this a threadlocal / context ? Creation time might be low but this is a memory-leak over time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on how my current implementation will result in a memory leak.
In my implementation, I believe the stemmer and tokenizer will go out of scope and will be garbage-collected normally.
If I were to use thread.local, then I suspect it might actually result in a minor memory leak. For example, given threading.local values are cleaned when their respective threads die, and the threads never die (e.g. user uses worker threads) then the threading.local value will never be cleaned up (depending on my implementation).
|
Oh apologies sorry I misread the code. Sure that’s fine to keep it alive
for the scope of execution.
Although I don’t follow how well formed thread local is a memory leak. That
does not make sense to me.
…On Mon, Dec 8, 2025 at 6:20 PM Kyle Diaz ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In chromadb/utils/embedding_functions/schemas/bm25_tokenizer.py
<#5993 (comment)>:
> @@ -213,10 +212,8 @@ def stem(self, token: str) -> str:
return cast(str, self._stemmer.stemWord(token))
***@***.***_cache(maxsize=1)
def get_english_stemmer() -> SnowballStemmer:
I'm not clear on how my current implementation will result in a memory
leak.
In my implementation, I believe the stemmer and tokenizer will go out of
scope and will be garbage-collected normally.
If I were to use thread.local, then I suspect it might actually result in
a minor memory leak. For example, given threading.local values are cleaned
when their respective threads die, and the threads never die (e.g. user
uses worker threads) then the threading.local value will never be cleaned
up (depending on my implementation).
—
Reply to this email directly, view it on GitHub
<#5993 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABKW32LO35XSVNC5B64J4QT4AYWYLAVCNFSM6AAAAACON6HE4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKNJUHE4DSOJXHE>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Here's an example: Let me reframe my argument: It's not that the threading.local values are truly uncollectible. The memory will be cleaned up if the threads die, but if the threads are long lived then the memory is unlikely to shrink and it behaves kind of like a leak. I don't think this is entirely desirable. |
|
We disagree on what a leak is in this context. I don't consider that a leak if done in a reasonable way. Anyways, discussing this past this point seems unnecessary as I already mentioned that acq/rel under the scope is fine here and the object is lightweight. I simply misread your diff. |

closes #5969
ChromaBM25EmbeddingFunctionusessnowballstemmerto tokenize. Currently, all invocations ofChromaBM25EmbeddingFunctionshare the samesnowballstemmerinstance because it's created in__init__and wrapped in a@cache.The issue is that
snowballstemmerinstances are not thread-safe. The tokenizer has an internal state. If you use the samesnowballstemmerobject to tokenize concurrently, it will break.Claude wrote me a small benchmark to measure how impactful it is to instantiate a new
snowballstemmer.It's insignificant enough that I think it's reasonable to instantiate a new object per-call.
Testing
I created a new test for BM25EF. It does not pass on
main, but it passes with my changes.