Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add Sampling-Based Clustering in SemDedup #538

Open
VibhuJawa opened this issue Feb 11, 2025 · 0 comments
Open

[FEA] Add Sampling-Based Clustering in SemDedup #538

VibhuJawa opened this issue Feb 11, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request jira

Comments

@VibhuJawa
Copy link
Collaborator

Description
We should add an option to perform clustering based on sampling in SemDedup, considering GPU memory constraints. Specifically, if sample_for_clustering=True, the system should:

  1. Perform sampling before clustering. The sampling ratio should be configurable, but by default, it should be dynamically inferred at runtime based on available GPU memory to optimize performance.
  2. Use the sampled data to fit a KMeans model.
  3. Apply the fitted KMeans model to cluster all of the data

This approach will enhance scalability and efficiency when dealing with large datasets.

Proposed Changes
Introduce a sample_for_clustering parameter in ClusteringModel to enable sampling-based clustering.

  1. If sample_for_clustering=True, extract a representative sample from the embeddings dataset before fitting the KMeans model.
  2. Train KMeans on the sampled embeddings.
  3. Use the trained model to predict cluster assignments for the full dataset.
  4. Ensure this functionality is compatible with the current partitioning and memory management strategies.

Future Direction
Explore the possibility of integrating sampling-based clustering directly within K-Means, eliminating the need for a two-step process.

Related issue:
#520

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request jira
Projects
None yet
Development

No branches or pull requests

3 participants