Skip to content

Conversation

@kylediaz
Copy link
Contributor

@kylediaz kylediaz commented Dec 9, 2025

closes #5969

ChromaBM25EmbeddingFunction uses snowballstemmer to tokenize. Currently, all invocations of ChromaBM25EmbeddingFunction share the same snowballstemmer instance because it's created in __init__ and wrapped in a @cache.

The issue is that snowballstemmer instances are not thread-safe. The tokenizer has an internal state. If you use the same snowballstemmer object to tokenize concurrently, it will break.

class EnglishStemmer(BaseStemmer):
    '''
    This class implements the stemming algorithm defined by a snowball script.
    Generated from english.sbl by Snowball 3.0.1 - https://snowballstem.org/
    '''

    g_aeo = {u"a", u"e", u"o"}

    g_v = {u"a", u"e", u"i", u"o", u"u", u"y"}

    g_v_WXY = {u"a", u"e", u"i", u"o", u"u", u"y", u"w", u"x", u"Y"}

    g_valid_LI = {u"c", u"d", u"e", u"g", u"h", u"k", u"m", u"n", u"r", u"t"}

    B_Y_found = False
    I_p2 = 0
    I_p1 = 0

    def __r_prelude(self):
        self.B_Y_found = False # <-- mutates state
 ...

Claude wrote me a small benchmark to measure how impactful it is to instantiate a new snowballstemmer.

Benchmarking Stemmer Creation Time

1. Direct _SnowballStemmerAdapter creation:
   Mean: 0.0004 ms
   Median: 0.0003 ms
   Min: 0.0002 ms
   Max: 0.0096 ms
   Std Dev: 0.0009 ms

It's insignificant enough that I think it's reasonable to instantiate a new object per-call.

Testing

I created a new test for BM25EF. It does not pass on main, but it passes with my changes.

@github-actions
Copy link

github-actions bot commented Dec 9, 2025

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

Copy link
Contributor Author

kylediaz commented Dec 9, 2025

@kylediaz kylediaz marked this pull request as ready for review December 9, 2025 01:03
@propel-code-bot
Copy link
Contributor

propel-code-bot bot commented Dec 9, 2025

Fix BM25 stemmer thread-safety by avoiding shared instances

This PR resolves the concurrency bug in ChromaBm25EmbeddingFunction by ensuring each call constructs its own SnowballStemmer-backed Bm25Tokenizer, eliminating the shared, non-thread-safe stemmer that previously lived in @cache. It also adds a regression test that drives the embedder through a ThreadPoolExecutor to verify embeddings remain consistent when invoked from multiple threads.

Key Changes

• Replaced the cached get_english_stemmer() usage with per-call instantiation inside ChromaBm25EmbeddingFunction._encode, while storing the resolved stopword list in self._stopword_list.
• Removed the @lru_cache decorator from get_english_stemmer in bm25_tokenizer.py, so each invocation yields a fresh _SnowballStemmerAdapter.
• Added test_multithreaded_usage in chromadb/test/ef/test_chroma_bm25_embedding_function.py that exercises BM25 embedding across multiple threads and asserts result integrity.

Affected Areas

• chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py
• chromadb/utils/embedding_functions/schemas/bm25_tokenizer.py
• chromadb/test/ef/test_chroma_bm25_embedding_function.py

This summary was automatically generated by @propel-code-bot

Comment on lines +82 to +84
stemmer = get_english_stemmer()
tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)
tokens = tokenizer.tokenize(text)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommended

[Maintainability] This change correctly addresses the thread-safety issue. To prevent future regressions where a developer might see this as a performance issue and try to 'optimize' it by re-introducing caching, it would be beneficial to add a comment explaining why a new stemmer and tokenizer are created on each call. This documents the non-obvious requirement that snowballstemmer instances are not thread-safe.

Suggested change
stemmer = get_english_stemmer()
tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)
tokens = tokenizer.tokenize(text)
# A new stemmer and tokenizer are created for each call because the underlying
# snowballstemmer is not thread-safe and cannot be shared across threads.
stemmer = get_english_stemmer()
tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)
tokens = tokenizer.tokenize(text)
Context for Agents
This change correctly addresses the thread-safety issue. To prevent future regressions where a developer might see this as a performance issue and try to 'optimize' it by re-introducing caching, it would be beneficial to add a comment explaining *why* a new stemmer and tokenizer are created on each call. This documents the non-obvious requirement that `snowballstemmer` instances are not thread-safe.

```suggestion
        # A new stemmer and tokenizer are created for each call because the underlying
        # snowballstemmer is not thread-safe and cannot be shared across threads.
        stemmer = get_english_stemmer()
        tokenizer = Bm25Tokenizer(stemmer, self._stopword_list, self.token_max_length)
        tokens = tokenizer.tokenize(text)
```

File: chromadb/utils/embedding_functions/chroma_bm25_embedding_function.py
Line: 84



@lru_cache(maxsize=1)
def get_english_stemmer() -> SnowballStemmer:
Copy link
Collaborator

@HammadB HammadB Dec 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a threadlocal / context ? Creation time might be low but this is a memory-leak over time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on how my current implementation will result in a memory leak.
In my implementation, I believe the stemmer and tokenizer will go out of scope and will be garbage-collected normally.

If I were to use thread.local, then I suspect it might actually result in a minor memory leak. For example, given threading.local values are cleaned when their respective threads die, and the threads never die (e.g. user uses worker threads) then the threading.local value will never be cleaned up (depending on my implementation).

@HammadB
Copy link
Collaborator

HammadB commented Dec 9, 2025 via email

@kylediaz
Copy link
Contributor Author

kylediaz commented Dec 9, 2025

Oh apologies sorry I misread the code. Sure that’s fine to keep it alive for the scope of execution. Although I don’t follow how well formed thread local is a memory leak. That does not make sense to me.

Here's an example:

denseEf = ChromaBm25EmbeddingFunction()

def handler():
    denseEf.embedSomething() # this creates a new threading.local value per thread

executor = ThreadPoolExecutor(max_workers=100)
for _ in range(1000):
    executor.submit(handler)

...

# executor keeps its worker threads alive. All the threading.local values in denseEf stay reserved

Let me reframe my argument: It's not that the threading.local values are truly uncollectible. The memory will be cleaned up if the threads die, but if the threads are long lived then the memory is unlikely to shrink and it behaves kind of like a leak. I don't think this is entirely desirable.

@HammadB
Copy link
Collaborator

HammadB commented Dec 9, 2025

We disagree on what a leak is in this context. I don't consider that a leak if done in a reasonable way.

Anyways, discussing this past this point seems unnecessary as I already mentioned that acq/rel under the scope is fine here and the object is lightweight.

I simply misread your diff.

Copy link
Contributor Author

kylediaz commented Dec 13, 2025

Merge activity

  • Dec 13, 8:27 PM UTC: A user started a stack merge that includes this pull request via Graphite.
  • Dec 13, 8:31 PM UTC: Graphite couldn't merge this PR because it had merge conflicts.

@kylediaz kylediaz changed the base branch from kylediaz/_doc_fix_doc_404_due_to_path_case_sensitivity to graphite-base/5993 December 13, 2025 20:30
@kylediaz kylediaz changed the base branch from graphite-base/5993 to main December 13, 2025 20:31
@kylediaz kylediaz merged commit 76b2470 into main Dec 13, 2025
64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: Error if ChromaBm25EmbeddingFunction is used concurrently

3 participants