Skip to content

feat: Optimize similarity search with vectorized cosine similarity (#634)#648

Merged
ritesh-1918 merged 2 commits into
ritesh-1918:gssocfrom
fennhelloworld:feat/vectorized-cosine-similarity
May 31, 2026
Merged

feat: Optimize similarity search with vectorized cosine similarity (#634)#648
ritesh-1918 merged 2 commits into
ritesh-1918:gssocfrom
fennhelloworld:feat/vectorized-cosine-similarity

Conversation

@fennhelloworld
Copy link
Copy Markdown

@fennhelloworld fennhelloworld commented May 29, 2026

Summary

Closes #634 — Optimizes the duplicate detection similarity search by replacing the per-ticket loop with vectorized batched cosine similarity.

Problem

DuplicateService.check_duplicate() previously iterated over every stored ticket embedding and called util.cos_sim() individually, resulting in O(n) separate tensor operations and kernel launches. Under load with many cached tickets, this caused significant latency.

Solution

All stored embeddings are now stacked into a single 2D tensor (_embedding_matrix) and compared against the query embedding in one batched matrix operation, then torch.argmax() identifies the best match.

Key changes

File Change
backend/services/duplicate_service.py Vectorized check_duplicate(), added _rebuild_embedding_matrix(), lazy matrix caching
backend/services/benchmark_similarity.py New benchmark script comparing loop vs vectorized performance

Benchmark results

Tickets Loop (ms) Vectorized (ms) Speedup
10 0.70 0.07 10x
100 2.90 0.09 33x
500 14.43 0.07 196x
1,000 29.52 0.07 394x
5,000 144.16 0.34 421x

Implementation details

  • Lazy rebuild: The embedding matrix is only rebuilt when _embedding_matrix_dirty is True (after add_ticket()), avoiding redundant computation.
  • Backward compatible: The public API (check_duplicate(), add_ticket(), is_available(), load()) is unchanged — same inputs, same outputs.
  • No new dependencies: Uses existing torch and sentence_transformers.util already in the project.

How to test

# Run the benchmark
python backend/services/benchmark_similarity.py

Checklist

Summary by CodeRabbit

  • Performance

    • Improved duplicate detection: uses batched/vectorized similarity checks with caching to significantly boost throughput for large ticket volumes.
  • Chores

    • Added an internal benchmarking tool to measure and compare duplicate-detection performance at scale.

Review Change Stack

…itesh-1918#634)

Replace per-ticket loop in DuplicateService.check_duplicate() with
vectorized batched cosine similarity computation. Instead of calling
util.cos_sim() individually for each stored embedding (O(n) kernel
launches), all stored embeddings are stacked into a single 2D tensor
and compared against the query in one matrix operation.

Key changes:
- Add _embedding_matrix, _ticket_ids, and _embedding_matrix_dirty
  to DuplicateService for lazy-rebuild caching
- Add _rebuild_embedding_matrix() to stack embeddings into 2D tensor
- Rewrite check_duplicate() to use vectorized util.cos_sim() with
  the stacked matrix and torch.argmax() for best-match selection
- Mark matrix dirty on add_ticket() for correctness
- Add benchmark_similarity.py showing speedup results:
  n=10: 10x, n=100: 33x, n=500: 196x, n=1000: 394x, n=5000: 421x

Closes ritesh-1918#634
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

Someone is attempting to deploy a commit to the ritesh Team on Vercel.

A member of the Team first needs to authorize it.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 29, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f24d6cd0-e1b7-4cb6-8209-637a668db2c9

📥 Commits

Reviewing files that changed from the base of the PR and between 35a9990 and 963a2da.

📒 Files selected for processing (2)
  • backend/services/benchmark_similarity.py
  • backend/services/duplicate_service.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • backend/services/duplicate_service.py
  • backend/services/benchmark_similarity.py

📝 Walkthrough

Walkthrough

DuplicateService now caches a stacked embedding matrix and performs batched cosine-similarity lookups (argmax) instead of per-ticket loops. A new benchmark script generates synthetic unit-normalized embeddings and measures loop vs. vectorized timings across multiple ticket counts, reporting average durations and speedup.

Changes

Vectorized Duplicate Detection

Layer / File(s) Summary
DuplicateService vectorized implementation
backend/services/duplicate_service.py
Adds torch/numpy imports, caches _embedding_matrix and _ticket_ids with _embedding_matrix_dirty, implements _rebuild_embedding_matrix() to stack embeddings, marks cache dirty in add_ticket(), and replaces per-ticket similarity loop in check_duplicate() with batched cosine-similarity and torch.argmax.
Benchmark comparison script
backend/services/benchmark_similarity.py
Adds EMBEDDING_DIM, _generate_synthetic_embeddings to create unit-normalized random embeddings, benchmark_loop (per-embedding loop timing), benchmark_vectorized (batched timing), and main() to run experiments over multiple dataset sizes and print average timings and speedups.

Sequence Diagram(s)

sequenceDiagram
  participant check_duplicate as check_duplicate()
  participant rebuild as _rebuild_embedding_matrix()
  participant torch as torch
  check_duplicate->>rebuild: if _embedding_matrix_dirty or missing -> rebuild cache
  rebuild->>torch: stack per-ticket embeddings into matrix
  check_duplicate->>torch: compute batched cosine similarities (matrix @ query)
  torch->>check_duplicate: similarity scores tensor
  check_duplicate->>check_duplicate: torch.argmax to pick best ticket id and score
  check_duplicate->>check_duplicate: apply threshold -> return duplicate result
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related issues

  • #634 — Vectorize Sentence-Transformers Cosine Similarity Computations: Implements vectorized batched similarity (PyTorch) and adds benchmark validation; aligns with the objective to replace loop-based cosine similarity.
  • #438 — Benchmark for loop vs vectorized cosine-similarity: This PR's benchmark and vectorized duplicate detection directly address that issue's benchmarking and optimization goal.

Poem

🐰 I hopped through tensors in the night,
Stacked the vectors, kept them tight,
One argmax finds what once took hours,
Benchmarks hum — matrix-powered flowers,
Swift as carrots, small delight.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR implements vectorization and benchmarking (primary coding objectives from #634) but omits ONNX export, which was listed as a required implementation step. Complete the ONNX export implementation or document why it was deferred, as issue #634 explicitly requires an ONNX export script as a core technical step.
Docstring Coverage ⚠️ Warning Docstring coverage is 77.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: vectorizing cosine similarity computations for improved performance in duplicate detection.
Out of Scope Changes check ✅ Passed All changes (vectorized similarity, caching, benchmark script) directly support the core objectives of issue #634 and remain within scope.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
backend/services/duplicate_service.py (2)

125-125: ⚡ Quick win

Make the optional parameter explicit (float | None).

Ruff flags this as an implicit Optional (RUF013). Line 23 already uses | None syntax, so this is consistent with the file.

♻️ Proposed fix
-    def check_duplicate(self, text: str, threshold: float = None) -> dict:
+    def check_duplicate(self, text: str, threshold: float | None = None) -> dict:
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/duplicate_service.py` at line 125, The function signature
for check_duplicate currently uses the implicit Optional pattern (threshold:
float = None); update the type annotation to be explicit by changing it to
threshold: float | None = None in the check_duplicate method so it matches the
file's use of `| None` and satisfies the RUF013 rule.

96-112: ⚡ Quick win

Fix potential state desync in _rebuild_embedding_matrix() by snapshotting _tickets

DuplicateService._rebuild_embedding_matrix() builds _ticket_ids and the stacked embeddings from two separate passes over self._tickets. add_ticket() appends to self._tickets and sets _embedding_matrix_dirty=True, while check_duplicate() may rebuild the matrix when dirty/stale, so concurrent mutation could desync _ticket_ids vs _embedding_matrix.

In backend/main.py, the call sites for duplicate_service.add_ticket(...) and duplicate_service.check_duplicate(...) are inside async def routes, but the service methods are synchronous and torch ops may release the GIL; if the app is running with multiple threads/workers within a process, this race is still plausible. Snapshotting avoids the mismatch without relying on deployment details.

-        self._ticket_ids = [tid for tid, _, _ in self._tickets]
-        embeddings = [emb for _, emb, _ in self._tickets]
-        self._embedding_matrix = torch.stack(embeddings)
+        tickets = list(self._tickets)  # consistent snapshot
+        self._ticket_ids = [tid for tid, _, _ in tickets]
+        embeddings = [emb for _, emb, _ in tickets]
+        self._embedding_matrix = torch.stack(embeddings)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/duplicate_service.py` around lines 96 - 112,
_rebuild_embedding_matrix currently iterates over self._tickets twice, which can
lead to _ticket_ids vs the stacked _embedding_matrix getting out of sync if
self._tickets is mutated concurrently (e.g., between add_ticket and
check_duplicate); fix by snapshotting tickets at the start of
_rebuild_embedding_matrix (e.g., local_tickets = list(self._tickets)) and then
build _ticket_ids and embeddings from that snapshot before calling torch.stack,
then set _embedding_matrix and _ticket_ids and clear _embedding_matrix_dirty;
this ensures atomic consistency without changing add_ticket or check_duplicate
signatures.
backend/services/benchmark_similarity.py (1)

26-45: ⚡ Quick win

Add an untimed warm-up before measuring.

The first timed round absorbs one-time allocation/kernel-init overhead, which can skew the reported averages (most visibly at small n). Since the PR's speedup claims rely on these numbers, a warm-up call makes them more representative.

♻️ Proposed fix
 def benchmark_loop(query: torch.Tensor, stored: list[torch.Tensor], rounds: int = 5) -> float:
     """Old approach: iterate and compute cos_sim one at a time."""
+    for emb in stored:  # warm-up
+        util.cos_sim(query, emb)
     times = []
 def benchmark_vectorized(query: torch.Tensor, matrix: torch.Tensor, rounds: int = 5) -> float:
     """New approach: single batched cos_sim call."""
     query_2d = query.unsqueeze(0)
+    util.cos_sim(query_2d, matrix)  # warm-up
     times = []
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/services/benchmark_similarity.py` around lines 26 - 45, Both
benchmark_loop and benchmark_vectorized should perform an untimed warm-up call
to amortize one-time allocation/kernel-init overhead before starting the timed
rounds; update the functions (benchmark_loop and benchmark_vectorized) to run
the same computation once (e.g., one pass over stored in benchmark_loop and one
util.cos_sim call in benchmark_vectorized) prior to the for _ in range(rounds)
timing loop so the measured rounds exclude initialization costs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@backend/services/benchmark_similarity.py`:
- Around line 26-45: Both benchmark_loop and benchmark_vectorized should perform
an untimed warm-up call to amortize one-time allocation/kernel-init overhead
before starting the timed rounds; update the functions (benchmark_loop and
benchmark_vectorized) to run the same computation once (e.g., one pass over
stored in benchmark_loop and one util.cos_sim call in benchmark_vectorized)
prior to the for _ in range(rounds) timing loop so the measured rounds exclude
initialization costs.

In `@backend/services/duplicate_service.py`:
- Line 125: The function signature for check_duplicate currently uses the
implicit Optional pattern (threshold: float = None); update the type annotation
to be explicit by changing it to threshold: float | None = None in the
check_duplicate method so it matches the file's use of `| None` and satisfies
the RUF013 rule.
- Around line 96-112: _rebuild_embedding_matrix currently iterates over
self._tickets twice, which can lead to _ticket_ids vs the stacked
_embedding_matrix getting out of sync if self._tickets is mutated concurrently
(e.g., between add_ticket and check_duplicate); fix by snapshotting tickets at
the start of _rebuild_embedding_matrix (e.g., local_tickets =
list(self._tickets)) and then build _ticket_ids and embeddings from that
snapshot before calling torch.stack, then set _embedding_matrix and _ticket_ids
and clear _embedding_matrix_dirty; this ensures atomic consistency without
changing add_ticket or check_duplicate signatures.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 42974083-3733-43bf-ad65-454075d2fccd

📥 Commits

Reviewing files that changed from the base of the PR and between da8faf2 and 35a9990.

📒 Files selected for processing (2)
  • backend/services/benchmark_similarity.py
  • backend/services/duplicate_service.py

@fennhelloworld
Copy link
Copy Markdown
Author

Thanks for the review @coderabbitai! All three nitpick items have been addressed in the latest push:

  1. Explicit Optional type (duplicate_service.py:125): Changed threshold: float = Nonethreshold: float | None = None
  2. Snapshot _tickets for thread safety (duplicate_service.py:109-111): Added tickets = list(self._tickets) snapshot before iterating in _rebuild_embedding_matrix() to avoid potential state desync
  3. Benchmark warm-up (benchmark_similarity.py): Added untimed warm-up calls in both benchmark_loop and benchmark_vectorized to avoid one-time allocation overhead skewing results

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 30, 2026

Tip

For best results, initiate chat on the files or code changes.

@fennhelloworld Great work on all three fixes! Each one addresses exactly the right concern:

  1. float | None = None — correct and idiomatic for Python 3.10+.
  2. Thread-safety snapshot — snapshotting _tickets before iterating is the right pattern to guard against concurrent mutations.
  3. Warm-up pass — eliminating first-call allocation bias from the benchmark numbers makes the reported speedups much more credible.

The implementation looks solid. Thanks for the thorough follow-through! 🐇✨


If you found this review helpful, would you consider giving us a shout-out on X?

Thank you for using CodeRabbit!

@ritesh-1918 ritesh-1918 changed the base branch from main to gssoc May 31, 2026 06:05
@ritesh-1918 ritesh-1918 added gssoc GirlScript Summer of Code gssoc:approved GSSoC Approved PR level:advanced Advanced level difficulty quality:exceptional Exceptional code quality type:feature New feature labels May 31, 2026
@ritesh-1918
Copy link
Copy Markdown
Owner

Superb implementation, @fennhelloworld! I've successfully resolved all conflicts in your PR and queued it for merging into gssoc.

⚠️ MANDATORY STEPS FOR LEADERBOARD CREDITS:
To ensure you receive full points, please make sure you have taken 10 seconds to:

Keep up the outstanding work! Let's build together! 🔥

@ritesh-1918 ritesh-1918 merged commit 50cb86f into ritesh-1918:gssoc May 31, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gssoc:approved GSSoC Approved PR gssoc GirlScript Summer of Code level:advanced Advanced level difficulty quality:exceptional Exceptional code quality type:feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BOUNTY] [level:critical] Vectorize Sentence-Transformers Cosine Similarity Computations with NumPy and ONNX Runtime

2 participants