This packages provides a simple convenience wrapper around some basic sklearn utilities for clustering. The only function available is eval_clustering()
.
pip install clustutils4r
model
: Clustering model object (untrained)
X
: Numpy array containing preprocessed, normalized, complete dataset features
gt_labels
: Numpy array containing encoded ground-truth labels for X
(often not available)
num_runs
: No. of times to fit a model
best_model_metric
: Metric to use to choose the best model
make_silhoutte_plots
: Whether to make silhouette plots for the best model (default = False
).
embed_data_in_2d
: Whether to compute TSNE embeddings of the X
to plotted alongside silhouette plot or plot the first 2 features (default = False
).
save_dir
: location to store results; directory will be created if it does not exist
save
: set True if you want to save all results in save_dir; defaults to False
show
: display all results; useful in notebooks; defaults to False
import os
import numpy as np
from sklearn.datasets import make_blobs, load_iris, load_digits
from eval_clustering import eval_clustering
## For testing purposes
rng = np.random.RandomState(0)
n_samples=1000
X, y = make_blobs(n_samples=n_samples, centers=5, n_features=2, cluster_std=0.60, random_state=rng)
save_dir = "results"
os.makedirs(save_dir, exist_ok=True)
best_model, grid_search_results = eval_clustering(
X=X, # dataset to cluster
gt_labels=y, # ground-truth labels; often these aren't available so don't pass this argument
num_runs=10, # number of times to fit a model
best_model_metric="FMI", # metric to use to choose the best model
make_silhoutte_plots=True, embed_data_in_2d=False, # whether to make silhouette plots
show=False, # whether to display the plots; this is used in a notebook
save=True, save_dir="results" # whether to save the plots
)