aaanalysis.AAclust.eval
- AAclust.eval(X, list_labels=None, names_datasets=None)[source]
Evaluates the quality of different clustering results.
The following established clustering measures are used:
BIC(Bayesian Information Criterion): Reflects the goodness of fit for the clustering while accounting for the number of clusters and parameters. The BIC value can range from negative infinity to positive infinity. A higher BIC indicates superior clustering quality.CH(Calinski-Harabasz Index): Represents the ratio of between-cluster dispersion mean to the within-cluster dispersion. The CH value ranges from 0 to positive infinity. A higher CH score suggests better-defined clustering.SC(Silhouette Coefficient): Evaluates the proximity of each data point in one cluster to the points in the neighboring clusters. The SC score lies between -1 and 1. A value closer to 1 implies better clustering.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to scales and columns to amino acids.
list_labels (array-like, shape (n_datasets, n_samples)) – List of arrays with cluster labels for samples in
Xobtained by theAAclust.fit()method. Unique label values indicate clusters.names_datasets (list, optional) – List of dataset names corresponding to
list_labels.
- Returns:
df_eval – Evaluation results for each set of clustering labels from
list_labels.- Return type:
pd.DataFrame
Notes
df_evalincludes the following columns:‘names’: Names (string) of evaluated datasets.
‘n_clusters’: Number (integer) of clusters, equal to number of medoids.
‘BIC’: BIC value (float) for clustering (-inf to inf).
‘CH’: CH value (float) for clustering (0 to inf).
‘SC’: SC value (float) for clustering (-1 to 1).
BIC was adapted form this StackExchange discussion and modified to align with the SC and CH score so that higher values signify better clustering, contrary to conventional BIC implementation favoring lower values. See [Breimann24a].
See also
AAclustPlot.eval(): the respective plotting method.sklearn.metrics.silhouette_score(): a commonly used clustering quality measures.sklearn.metrics.calinski_harabasz_score(): a commonly used clustering quality measures.
Examples
Different clustering results can be evaluated and compared using the
AAclust().eval()method. We perform five clusterings withn_clustersfor 5, 10, 25, 50, and 100 utilizing a Python comprehension list:import aaanalysis as aa aa.options["verbose"] = False X = aa.load_scales().T aac = aa.AAclust() list_labels = [aac.fit(X, n_clusters=n).labels_ for n in [5, 10, 25, 50, 100]] df_eval = aac.eval(X, list_labels=list_labels) aa.display_df(df_eval)
name n_clusters BIC CH SC 1 Set 1 5 -541.290364 100.873353 0.163885 2 Set 2 10 420.738104 83.914216 0.166582 3 Set 3 25 754.158910 50.550567 0.146916 4 Set 4 50 267.863461 34.238676 0.143133 5 Set 5 100 -1498.892425 23.628475 0.137909 The name of the scale sets can be provided using the
names_datasetsparameter, which must match with the number of evaluated cluster sets:names = [f"Clustering {i}" for i in range(1, 6)] df_eval = aac.eval(X, list_labels=list_labels, names_datasets=names) aa.display_df(df_eval)name n_clusters BIC CH SC 1 Clustering 1 5 -541.290364 100.873353 0.163885 2 Clustering 2 10 420.738104 83.914216 0.166582 3 Clustering 3 25 754.158910 50.550567 0.146916 4 Clustering 4 50 267.863461 34.238676 0.143133 5 Clustering 5 100 -1498.892425 23.628475 0.137909