aaanalysis.AAclust.eval

AAclust.eval(X, list_labels=None, names_datasets=None)[source]

Evaluates the quality of different clustering results.

The following established clustering measures are used:

  • BIC (Bayesian Information Criterion): Reflects the goodness of fit for the clustering while accounting for the number of clusters and parameters. The BIC value can range from negative infinity to positive infinity. A higher BIC indicates superior clustering quality.

  • CH (Calinski-Harabasz Index): Represents the ratio of between-cluster dispersion mean to the within-cluster dispersion. The CH value ranges from 0 to positive infinity. A higher CH score suggests better-defined clustering.

  • SC (Silhouette Coefficient): Evaluates the proximity of each data point in one cluster to the points in the neighboring clusters. The SC score lies between -1 and 1. A value closer to 1 implies better clustering.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to scales and columns to amino acids.

  • list_labels (array-like, shape (n_datasets, n_samples)) – List of arrays with cluster labels for samples in X obtained by the AAclust.fit() method. Unique label values indicate clusters.

  • names_datasets (list, optional) – List of dataset names corresponding to list_labels.

Returns:

df_eval – Evaluation results for each set of clustering labels from list_labels.

Return type:

pd.DataFrame

Notes

df_eval includes the following columns:

  • ‘names’: Names (string) of evaluated datasets.

  • ‘n_clusters’: Number (integer) of clusters, equal to number of medoids.

  • ‘BIC’: BIC value (float) for clustering (-inf to inf).

  • ‘CH’: CH value (float) for clustering (0 to inf).

  • ‘SC’: SC value (float) for clustering (-1 to 1).

BIC was adapted form this StackExchange discussion and modified to align with the SC and CH score so that higher values signify better clustering, contrary to conventional BIC implementation favoring lower values. See [Breimann24a].

See also

Examples

Different clustering results can be evaluated and compared using the AAclust().eval() method. We perform five clusterings with n_clusters for 5, 10, 25, 50, and 100 utilizing a Python comprehension list:

import aaanalysis as aa
aa.options["verbose"] = False
X = aa.load_scales().T
aac = aa.AAclust()
list_labels = [aac.fit(X, n_clusters=n).labels_ for n in [5, 10, 25, 50, 100]]
df_eval = aac.eval(X, list_labels=list_labels)
aa.display_df(df_eval)
  name n_clusters BIC CH SC
1 Set 1 5 -541.290364 100.873353 0.163885
2 Set 2 10 420.738104 83.914216 0.166582
3 Set 3 25 754.158910 50.550567 0.146916
4 Set 4 50 267.863461 34.238676 0.143133
5 Set 5 100 -1498.892425 23.628475 0.137909

The name of the scale sets can be provided using the names_datasets parameter, which must match with the number of evaluated cluster sets:

names = [f"Clustering {i}" for i in range(1, 6)]
df_eval = aac.eval(X, list_labels=list_labels, names_datasets=names)
aa.display_df(df_eval)
  name n_clusters BIC CH SC
1 Clustering 1 5 -541.290364 100.873353 0.163885
2 Clustering 2 10 420.738104 83.914216 0.166582
3 Clustering 3 25 754.158910 50.550567 0.146916
4 Clustering 4 50 267.863461 34.238676 0.143133
5 Clustering 5 100 -1498.892425 23.628475 0.137909