Add batched pairwise similarity method for Semantic Dedup #581

praateekmahajan · 2025-03-07T21:59:00Z

Description

Resolves #520

Currently if a single cluster is large enough, we'll likely OOM since M @ M.T requires N**2 storage. A batched version breaks it up into smaller batches B and performs M @ B.T. The only thing we need to be careful is how to zero out the diagonals and get the upper triangular matrix.

other nits

always l2 normalize the embedding vector so that M @ M.T results in an absolute max value of 1
renamed _semdedup to pairwise_similarity
added tests for existing function + the batched approach.
default now is batched implementation

Usage

# Add snippet demonstrating usage

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Praateek <[email protected]>

sarahyurick

Nice PR, added a couple minor comments.

nemo_curator/modules/semantic_dedup/semanticclusterleveldedup.py

sarahyurick · 2025-03-07T23:16:25Z

nemo_curator/utils/semdedup_utils.py

+    # Compute pairwise cosine similarity
+    pairwise_sim_matrix = cluster_reps @ (cluster_reps.T)


Haha I don't think I've used @ before. Would there be any advantage to using torch.mm, torch.matmul, etc.?

tests/test_semdedup.py

sarahyurick · 2025-03-07T23:18:01Z

tutorials/image-curation/image-curation.ipynb

@@ -681,6 +681,7 @@
    "    id_column_type=\"str\",\n",
    "    embedding_col=\"image_embedding\",\n",
    "    which_to_keep=\"hard\",\n",
+    "    batched_cosine_similarity=1024,\n",


Has this been tested?

Nope, do we always manually run these notebooks for such PRs? That'll be a time sink but I'm okay to do it that's the practice

If it is expected to produce the same results as before, it is okay with me. Sometimes I leave notebooks unchanged (or add changes to have it keep the previous default) if the output is expected to change, so that the user won't be confused when their cell outputs are different than the ones on GitHub.

But it sounds like that isn't the case here?

Signed-off-by: Praateek <[email protected]>

praateekmahajan added 2 commits March 7, 2025 21:56

add batched similarity

063a621

Signed-off-by: Praateek <[email protected]>

pre-commit

61e1e3a

Signed-off-by: Praateek <[email protected]>

praateekmahajan requested review from sarahyurick, VibhuJawa, ryantwolf and ayushdg March 7, 2025 22:00

praateekmahajan added 2 commits March 7, 2025 23:01

increase tolerance

9852340

Signed-off-by: Praateek <[email protected]>

uncomment

c15db5a

Signed-off-by: Praateek <[email protected]>

sarahyurick reviewed Mar 7, 2025

View reviewed changes

praateekmahajan requested a review from sarahyurick March 7, 2025 23:35

praateekmahajan changed the title ~~Add batched pairwise similarity method~~ Add batched pairwise similarity method for Semantic Dedup Mar 7, 2025

praateekmahajan added 2 commits March 7, 2025 23:47

pr review + sad that we can only run tests on gpu

a76a0bb

Signed-off-by: Praateek <[email protected]>

pc

d5ce2ec

Signed-off-by: Praateek <[email protected]>

praateekmahajan mentioned this pull request Mar 8, 2025

torch.OutOfMemoryError: CUDA out of memory. while performing peft curation with sdg on default configs #520

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batched pairwise similarity method for Semantic Dedup #581

Add batched pairwise similarity method for Semantic Dedup #581

praateekmahajan commented Mar 7, 2025 •

edited

Loading

sarahyurick left a comment

sarahyurick Mar 7, 2025

sarahyurick Mar 7, 2025

praateekmahajan Mar 7, 2025

sarahyurick Mar 7, 2025

sarahyurick Mar 7, 2025

		# Compute pairwise cosine similarity
		pairwise_sim_matrix = cluster_reps @ (cluster_reps.T)

Add batched pairwise similarity method for Semantic Dedup #581

Are you sure you want to change the base?

Add batched pairwise similarity method for Semantic Dedup #581

Conversation

praateekmahajan commented Mar 7, 2025 • edited Loading

Description

other nits

Usage

Checklist

sarahyurick left a comment

Choose a reason for hiding this comment

sarahyurick Mar 7, 2025

Choose a reason for hiding this comment

sarahyurick Mar 7, 2025

Choose a reason for hiding this comment

praateekmahajan Mar 7, 2025

Choose a reason for hiding this comment

sarahyurick Mar 7, 2025

Choose a reason for hiding this comment

sarahyurick Mar 7, 2025

Choose a reason for hiding this comment

praateekmahajan commented Mar 7, 2025 •

edited

Loading