-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add batched pairwise similarity method for Semantic Dedup #581
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice PR, added a couple minor comments.
nemo_curator/modules/semantic_dedup/semanticclusterleveldedup.py
Outdated
Show resolved
Hide resolved
nemo_curator/modules/semantic_dedup/semanticclusterleveldedup.py
Outdated
Show resolved
Hide resolved
nemo_curator/utils/semdedup_utils.py
Outdated
# Compute pairwise cosine similarity | ||
pairwise_sim_matrix = cluster_reps @ (cluster_reps.T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha I don't think I've used @
before. Would there be any advantage to using torch.mm
, torch.matmul
, etc.?
@@ -681,6 +681,7 @@ | |||
" id_column_type=\"str\",\n", | |||
" embedding_col=\"image_embedding\",\n", | |||
" which_to_keep=\"hard\",\n", | |||
" batched_cosine_similarity=1024,\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Has this been tested?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, do we always manually run these notebooks for such PRs? That'll be a time sink but I'm okay to do it that's the practice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is expected to produce the same results as before, it is okay with me. Sometimes I leave notebooks unchanged (or add changes to have it keep the previous default) if the output is expected to change, so that the user won't be confused when their cell outputs are different than the ones on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it sounds like that isn't the case here?
Signed-off-by: Praateek <[email protected]>
Signed-off-by: Praateek <[email protected]>
Description
Resolves #520
Currently if a single cluster is large enough, we'll likely OOM since
M @ M.T
requiresN**2
storage. A batched version breaks it up into smaller batches B and performsM @ B.T
. The only thing we need to be careful is how to zero out the diagonals and get the upper triangular matrix.other nits
M @ M.T
results in an absolute max value of 1_semdedup
topairwise_similarity
Usage
# Add snippet demonstrating usage
Checklist