[BUG] BM25 does not work when multithreading #5993

kylediaz · 2025-12-09T00:42:40Z

ChromaBM25EmbeddingFunction uses snowballstemmer to tokenize. Currently, all invocations of ChromaBM25EmbeddingFunction share the same snowballstemmer instance because it's created in __init__ and wrapped in a @cache.

The issue is that snowballstemmer instances are not thread-safe. The tokenizer has an internal state. If you use the same snowballstemmer object to tokenize concurrently, it will break.

class EnglishStemmer(BaseStemmer):
    '''
    This class implements the stemming algorithm defined by a snowball script.
    Generated from english.sbl by Snowball 3.0.1 - https://snowballstem.org/
    '''

    g_aeo = {u"a", u"e", u"o"}

    g_v = {u"a", u"e", u"i", u"o", u"u", u"y"}

    g_v_WXY = {u"a", u"e", u"i", u"o", u"u", u"y", u"w", u"x", u"Y"}

    g_valid_LI = {u"c", u"d", u"e", u"g", u"h", u"k", u"m", u"n", u"r", u"t"}

    B_Y_found = False
    I_p2 = 0
    I_p1 = 0

    def __r_prelude(self):
        self.B_Y_found = False # <-- mutates state
 ...

Claude wrote me a small benchmark to measure how impactful it is to instantiate a new snowballstemmer.

Benchmarking Stemmer Creation Time

1. Direct _SnowballStemmerAdapter creation:
   Mean: 0.0004 ms
   Median: 0.0003 ms
   Min: 0.0002 ms
   Max: 0.0096 ms
   Std Dev: 0.0009 ms

It's insignificant enough that I think it's reasonable to instantiate a new object per-call.

Testing

I created a new test for BM25EF. It does not pass on main, but it passes with my changes.

github-actions · 2025-12-09T00:42:47Z

kylediaz · 2025-12-09T00:42:59Z

[BUG] BM25 does not work when multithreading #5993 👈 (View in Graphite)
[DOC] Fix doc 404 due to path case sensitivity #5990
[DOC] Fix doc 500 due to tabs component #5989
[DOC] Fix doc 500 #5966 : 1 other dependent PR (#5968 )
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

propel-code-bot · 2025-12-09T01:04:27Z

Fix BM25 stemmer thread-safety by avoiding shared instances

This PR resolves the concurrency bug in ChromaBm25EmbeddingFunction by ensuring each call constructs its own SnowballStemmer-backed Bm25Tokenizer, eliminating the shared, non-thread-safe stemmer that previously lived in @cache. It also adds a regression test that drives the embedder through a ThreadPoolExecutor to verify embeddings remain consistent when invoked from multiple threads.

Key Changes

• Replaced the cached get_english_stemmer() usage with per-call instantiation inside ChromaBm25EmbeddingFunction._encode, while storing the resolved stopword list in self._stopword_list.
• Removed the @lru_cache decorator from get_english_stemmer in bm25_tokenizer.py, so each invocation yields a fresh _SnowballStemmerAdapter.
• Added test_multithreaded_usage in chromadb/test/ef/test_chroma_bm25_embedding_function.py that exercises BM25 embedding across multiple threads and asserts result integrity.

Affected Areas

• chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py
• chromadb/utils/embedding_functions/schemas/bm25_tokenizer.py
• chromadb/test/ef/test_chroma_bm25_embedding_function.py

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-12-09T01:06:44Z

chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py

+        stemmer = get_english_stemmer()
+        tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)
+        tokens = tokenizer.tokenize(text)


[Maintainability] This change correctly addresses the thread-safety issue. To prevent future regressions where a developer might see this as a performance issue and try to 'optimize' it by re-introducing caching, it would be beneficial to add a comment explaining why a new stemmer and tokenizer are created on each call. This documents the non-obvious requirement that snowballstemmer instances are not thread-safe.

Suggested change

stemmer = get_english_stemmer()

tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)

tokens = tokenizer.tokenize(text)

# A new stemmer and tokenizer are created for each call because the underlying

# snowballstemmer is not thread-safe and cannot be shared across threads.

stemmer = get_english_stemmer()

tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)

tokens = tokenizer.tokenize(text)

Context for Agents

This change correctly addresses the thread-safety issue. To prevent future regressions where a developer might see this as a performance issue and try to 'optimize' it by re-introducing caching, it would be beneficial to add a comment explaining *why* a new stemmer and tokenizer are created on each call. This documents the non-obvious requirement that `snowballstemmer` instances are not thread-safe. ```suggestion # A new stemmer and tokenizer are created for each call because the underlying # snowballstemmer is not thread-safe and cannot be shared across threads. stemmer = get_english_stemmer() tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length) tokens = tokenizer.tokenize(text) ``` File: chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py Line: 84

HammadB · 2025-12-09T01:52:38Z

chromadb/utils/embedding_functions/schemas/bm25_tokenizer.py



-@lru_cache(maxsize=1)
 def get_english_stemmer() -> SnowballStemmer:


Can we make this a threadlocal / context ? Creation time might be low but this is a memory-leak over time

I'm not clear on how my current implementation will result in a memory leak.
In my implementation, I believe the stemmer and tokenizer will go out of scope and will be garbage-collected normally.

If I were to use thread.local, then I suspect it might actually result in a minor memory leak. For example, given threading.local values are cleaned when their respective threads die, and the threads never die (e.g. user uses worker threads) then the threading.local value will never be cleaned up (depending on my implementation).

HammadB · 2025-12-09T02:23:21Z

Oh apologies sorry I misread the code. Sure that’s fine to keep it alive for the scope of execution. Although I don’t follow how well formed thread local is a memory leak. That does not make sense to me.

…

On Mon, Dec 8, 2025 at 6:20 PM Kyle Diaz ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In chromadb/utils/embedding_functions/schemas/bm25_tokenizer.py <#5993 (comment)>: > @@ -213,10 +212,8 @@ def stem(self, token: str) -> str: return cast(str, self._stemmer.stemWord(token)) ***@***.***_cache(maxsize=1) def get_english_stemmer() -> SnowballStemmer: I'm not clear on how my current implementation will result in a memory leak. In my implementation, I believe the stemmer and tokenizer will go out of scope and will be garbage-collected normally. If I were to use thread.local, then I suspect it might actually result in a minor memory leak. For example, given threading.local values are cleaned when their respective threads die, and the threads never die (e.g. user uses worker threads) then the threading.local value will never be cleaned up (depending on my implementation). — Reply to this email directly, view it on GitHub <#5993 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKW32LO35XSVNC5B64J4QT4AYWYLAVCNFSM6AAAAACON6HE4GVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZTKNJUHE4DSOJXHE> . You are receiving this because you commented.Message ID: ***@***.***>

kylediaz · 2025-12-09T03:47:12Z

Oh apologies sorry I misread the code. Sure that’s fine to keep it alive for the scope of execution. Although I don’t follow how well formed thread local is a memory leak. That does not make sense to me.

Here's an example:

denseEf = ChromaBm25EmbeddingFunction()

def handler():
    denseEf.embedSomething() # this creates a new threading.local value per thread

executor = ThreadPoolExecutor(max_workers=100)
for _ in range(1000):
    executor.submit(handler)

...

# executor keeps its worker threads alive. All the threading.local values in denseEf stay reserved

Let me reframe my argument: It's not that the threading.local values are truly uncollectible. The memory will be cleaned up if the threads die, but if the threads are long lived then the memory is unlikely to shrink and it behaves kind of like a leak. I don't think this is entirely desirable.

HammadB · 2025-12-09T04:03:06Z

We disagree on what a leak is in this context. I don't consider that a leak if done in a reasonable way.

Anyways, discussing this past this point seems unnecessary as I already mentioned that acq/rel under the scope is fine here and the object is lightweight.

I simply misread your diff.

kylediaz · 2025-12-13T20:27:20Z

Merge activity

Dec 13, 8:27 PM UTC: A user started a stack merge that includes this pull request via Graphite.
Dec 13, 8:31 PM UTC: Graphite couldn't merge this PR because it had merge conflicts.

…hreading

kylediaz added 7 commits December 8, 2025 14:11

[DOC] Fix doc 404 and 500 and other stuff

118ffaf

better implementation

541be64

separate into PRs

a128d4f

[DOC] Fix doc 500 due to tabs component

68a2083

[DOC] Fix doc 404 due to path case sensitivity

6585da5

fix

3b7807d

[BUG] BM25 does not work when multithreading

badf029

This was referenced Dec 9, 2025

[DOC] Fix doc 500 due to tabs component #5989

Merged

[DOC] Fix doc 404 due to path case sensitivity #5990

Merged

fix test

079c580

kylediaz requested review from itaismith and tanujnay112 December 9, 2025 01:03

kylediaz marked this pull request as ready for review December 9, 2025 01:03

propel-code-bot bot reviewed Dec 9, 2025

View reviewed changes

HammadB reviewed Dec 9, 2025

View reviewed changes

HammadB approved these changes Dec 13, 2025

View reviewed changes

kylediaz changed the base branch from kylediaz/_doc_fix_doc_404_due_to_path_case_sensitivity to graphite-base/5993 December 13, 2025 20:30

kylediaz changed the base branch from graphite-base/5993 to main December 13, 2025 20:31

Merge branch 'main' into kylediaz/_bug_bm25_does_not_work_when_multit…

9197227

…hreading

kylediaz merged commit 76b2470 into main Dec 13, 2025
64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] BM25 does not work when multithreading #5993

[BUG] BM25 does not work when multithreading #5993

kylediaz commented Dec 9, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 9, 2025

Uh oh!

kylediaz commented Dec 9, 2025 •

edited

Loading

Uh oh!

propel-code-bot bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

propel-code-bot bot Dec 9, 2025

Uh oh!

HammadB Dec 9, 2025 •

edited

Loading

Uh oh!

kylediaz Dec 9, 2025

Uh oh!

HammadB commented Dec 9, 2025 via email

Uh oh!

kylediaz commented Dec 9, 2025

Uh oh!

HammadB commented Dec 9, 2025 •

edited

Loading

Uh oh!

kylediaz commented Dec 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants



		@lru_cache(maxsize=1)
		def get_english_stemmer() -> SnowballStemmer:

[BUG] BM25 does not work when multithreading #5993

[BUG] BM25 does not work when multithreading #5993

Conversation

kylediaz commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

github-actions bot commented Dec 9, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

kylediaz commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

HammadB Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylediaz Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

HammadB commented Dec 9, 2025 via email

Uh oh!

kylediaz commented Dec 9, 2025

Uh oh!

HammadB commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylediaz commented Dec 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge activity

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kylediaz commented Dec 9, 2025 •

edited

Loading

kylediaz commented Dec 9, 2025 •

edited

Loading

propel-code-bot bot commented Dec 9, 2025 •

edited

Loading

HammadB Dec 9, 2025 •

edited

Loading

HammadB commented Dec 9, 2025 •

edited

Loading

kylediaz commented Dec 13, 2025 •

edited

Loading