[FEA] Add Sampling-Based Clustering in SemDedup #538

VibhuJawa · 2025-02-11T19:09:11Z

Description
We should add an option to perform clustering based on sampling in SemDedup, considering GPU memory constraints. Specifically, if sample_for_clustering=True, the system should:

Perform sampling before clustering. The sampling ratio should be configurable, but by default, it should be dynamically inferred at runtime based on available GPU memory to optimize performance.
Use the sampled data to fit a KMeans model.
Apply the fitted KMeans model to cluster all of the data

This approach will enhance scalability and efficiency when dealing with large datasets.

Proposed Changes
Introduce a sample_for_clustering parameter in ClusteringModel to enable sampling-based clustering.

If sample_for_clustering=True, extract a representative sample from the embeddings dataset before fitting the KMeans model.
Train KMeans on the sampled embeddings.
Use the trained model to predict cluster assignments for the full dataset.
Ensure this functionality is compatible with the current partitioning and memory management strategies.

Future Direction
Explore the possibility of integrating sampling-based clustering directly within K-Means, eliminating the need for a two-step process.

Related issue:
#520

VibhuJawa added the enhancement New feature or request label Feb 11, 2025

praateekmahajan self-assigned this Feb 26, 2025

VibhuJawa mentioned this issue Mar 5, 2025

torch.OutOfMemoryError: CUDA out of memory. while performing peft curation with sdg on default configs #520

Open

sithape2025 added the jira label Mar 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add Sampling-Based Clustering in SemDedup #538

[FEA] Add Sampling-Based Clustering in SemDedup #538

VibhuJawa commented Feb 11, 2025

[FEA] Add Sampling-Based Clustering in SemDedup #538

[FEA] Add Sampling-Based Clustering in SemDedup #538

Comments

VibhuJawa commented Feb 11, 2025