Skip to content

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Dec 18, 2025

Problem

The filter setting mechanism in the embedding linker was inefficient, causing significant performance bottlenecks when processing datasets with many small documents that with individual filters. This was particularly problematic during metric calculation for the COMETA dataset, where filter setup dominated the overall runtime.

Performance Impact

Before: Running with the spacy tokenizer took 14,671 seconds for COMETA

After: Took 205.7 seconds with spacy for COMETA

Speedup: ~70x improvement

Changes Made

  1. Added inverted index precomputation (_initialize_filter_structures):
    • Built _cui_idx_to_name_idxs: maps CUI indices to lists of name indices containing them
    • This flips the lookup direction from O(n) to O(1) for filter operations
    • Cached _has_cuis_all_cached to avoid recomputation
  2. Optimized filter methods
    • _get_include_filters_1cui: Single CUI include filter using inverted index
    • _get_include_filters_multi_cui: Multi-CUI include filter with NumPy concatenation
    • _get_exclude_filters_1cui / _get_exclude_filters_multi_cui: Corresponding exclude filters
    • Routing methods _get_include_filters and _get_exclude_filters choose appropriate implementation
  3. Refactored _set_filters method:
    • Replaced nested loops and list comprehensions with direct index lookups
    • Simplified logic flow using the new optimized methods

Checks to make sure this doesn't change behaviour

I ran the metrics on both COMETA and the Linking Challenge datasets before and after. The precision/recall/F1 are identical. So I'm fairly confident the changed hasn't messed anything up.

@tomolopolis
Copy link
Member

Copy link
Member

@tomolopolis tomolopolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@adam-sutton-1992 adam-sutton-1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only one change I can see.

texts.append(text)
return self._embed(texts, self.device)

def _initialize_cui_name_mapping(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't called after being defined.

@mart-r mart-r merged commit 24c81cf into main Jan 8, 2026
21 checks passed
@mart-r mart-r deleted the feat/medcat/CU-869bhknfm-faster-filters-for-embedding-linker branch January 8, 2026 12:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants