You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description
We should add an option to perform clustering based on sampling in SemDedup, considering GPU memory constraints. Specifically, if sample_for_clustering=True, the system should:
Perform sampling before clustering. The sampling ratio should be configurable, but by default, it should be dynamically inferred at runtime based on available GPU memory to optimize performance.
Use the sampled data to fit a KMeans model.
Apply the fitted KMeans model to cluster all of the data
This approach will enhance scalability and efficiency when dealing with large datasets.
Proposed Changes
Introduce a sample_for_clustering parameter in ClusteringModel to enable sampling-based clustering.
If sample_for_clustering=True, extract a representative sample from the embeddings dataset before fitting the KMeans model.
Train KMeans on the sampled embeddings.
Use the trained model to predict cluster assignments for the full dataset.
Ensure this functionality is compatible with the current partitioning and memory management strategies.
Future Direction
Explore the possibility of integrating sampling-based clustering directly within K-Means, eliminating the need for a two-step process.
Description
We should add an option to perform clustering based on sampling in SemDedup, considering GPU memory constraints. Specifically, if sample_for_clustering=True, the system should:
This approach will enhance scalability and efficiency when dealing with large datasets.
Proposed Changes
Introduce a sample_for_clustering parameter in ClusteringModel to enable sampling-based clustering.
Future Direction
Explore the possibility of integrating sampling-based clustering directly within K-Means, eliminating the need for a two-step process.
Related issue:
#520
The text was updated successfully, but these errors were encountered: