aaanalysis.comp_bic_score
- aaanalysis.comp_bic_score(X=None, labels=None)[source]
Compute an adjusted Bayesian Information Criterion (BIC) (-∞, ∞) for assessing clustering quality.
Described in [Breimann24b], this adjusted BIC is computed for a given set of clusters in the dataset
X. The BIC is a clustering model selection criterion that balances the model complexity against the likelihood of the data distribution. Unlike the traditional BIC where lower values are better, this adjusted BIC, is modified to align with other clustering evaluation measures like the Silhouette coefficient and the Calinski-Harabasz score. In this adjusted version, higher values indicate better clustering.Added in version 1.0.0.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. ‘Rows’ typically correspond to proteins and ‘columns’ to features.
labels (array-like, shape (n_samples,)) – Predicted labels for each sample. Each label corresponds to a cluster.
- Returns:
bic – The Bayesian Information Criterion value. A lower BIC value indicates a better model fit to the data.
- Return type:
Notes
An epsilon value (1e-10) is utilized to prevent division by zero in the computation.
See also
The Silhouette coefficient [-1, 1] can be computed by
sklearn.metrics.silhouette_score().The Calinski Harabasz score [0, ∞] can be obtained using
sklearn.metrics.calinski_harabasz_score().Clustering evaluation can be performed using
AAclust.eval().
Examples
The Bayesian Information Criterion (BIC) [-∞, ∞] for a given set of clusters in the dataset
Xcan be computed using thecomp_bic_score()function. As introduced in [Breimann24a], the BIC was adjusted so that higher values indicate better clustering results:import numpy as np import seaborn as sns import matplotlib.pyplot as plt import pandas as pd import aaanalysis as aa # Generate random 2D data for two distinct groups group_blue = np.random.normal(loc=[-1, -1], scale=[0.5, 0.5], size=(1000, 2)) group_red = np.random.normal(loc=[1, 1], scale=[0.5, 0.5], size=(1000, 2)) # Combine data into a single dataset and create labels X = np.vstack([group_blue, group_red]) labels = np.array([0]*1000 + [1]*1000) # Compute BIC score bic_score = round(aa.comp_bic_score(X, labels), 3) # Create a DataFrame for Seaborn df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2']) df['Label'] = labels # Plot using seaborn aa.plot_settings() sns.scatterplot(data=df, x='Feature 1', y='Feature 2', hue='Label', palette=['tab:blue', 'tab:red'], legend=False) plt.title(f"BIC = {bic_score} (Perfect Labeling)") sns.despine() plt.show()
Labeling both groups randomly is dramatically decreasing the
bic_score:# Random labeling np.random.shuffle(labels) df['Label'] = labels # Compute BIC score bic_score = round(aa.comp_bic_score(X, labels), 3) # Plot using seaborn sns.scatterplot(data=df, x='Feature 1', y='Feature 2', hue='Label', palette=['tab:blue', 'tab:red'], legend=False) plt.title(f"BIC = {bic_score} (Random Labeling)") sns.despine() plt.show()