comp_bic_score

comp_bic_score(X, labels)[source]

Compute an adjusted Bayesian Information Criterion (BIC) (-∞, ∞) for assessing clustering quality.

Described in [Breimann24b], this adjusted BIC is computed for a given set of clusters in the dataset X. The BIC is a clustering model selection criterion that balances the model complexity against the likelihood of the data distribution. Unlike the traditional BIC where lower values are better, this adjusted BIC, is modified to align with other clustering evaluation measures like the Silhouette coefficient and the Calinski-Harabasz score. In this adjusted version, higher values indicate better clustering.

Added in version 1.0.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. ‘Rows’ typically correspond to proteins and ‘columns’ to features.
labels (array-like, shape (n_samples,)) – Predicted labels for each sample. Each label corresponds to a cluster.

Returns:

bic – The adjusted Bayesian Information Criterion value. Higher values indicate better clustering quality.

Return type:

float

Notes

An epsilon value (1e-10) is utilized to prevent division by zero in the computation.

See also

The Silhouette coefficient [-1, 1] can be computed by sklearn.metrics.silhouette_score().
The Calinski Harabasz score [0, ∞] can be obtained using sklearn.metrics.calinski_harabasz_score().
Clustering evaluation can be performed using AAclust.eval().

Examples

The Bayesian Information Criterion (BIC) [-∞, ∞] for a given set of clusters in the dataset X can be computed using the comp_bic_score() function. As introduced in [Breimann24a], the BIC was adjusted so that higher values indicate better clustering results:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import aaanalysis as aa

# Generate random 2D data for two distinct groups
group_blue = np.random.normal(loc=[-1, -1], scale=[0.5, 0.5], size=(1000, 2))
group_red = np.random.normal(loc=[1, 1], scale=[0.5, 0.5], size=(1000, 2))

# Combine data into a single dataset and create labels
X = np.vstack([group_blue, group_red])
labels = np.array([0]*1000 + [1]*1000)

# Compute BIC score
bic_score = round(aa.comp_bic_score(X, labels), 3)

# Create a DataFrame for Seaborn
df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
df['Label'] = labels

# Plot using seaborn
aa.plot_settings()
sns.scatterplot(data=df, x='Feature 1', y='Feature 2', hue='Label',
                palette=['tab:blue', 'tab:red'], legend=False)

plt.title(f"BIC = {bic_score} (Perfect Labeling)")
sns.despine()
plt.show()

../_images/comp_bic_score_1_output_1_0.png

Labeling both groups randomly is dramatically decreasing the bic_score:

# Random labeling
np.random.shuffle(labels)
df['Label'] = labels

# Compute BIC score
bic_score = round(aa.comp_bic_score(X, labels), 3)


# Plot using seaborn
sns.scatterplot(data=df, x='Feature 1', y='Feature 2', hue='Label',
                palette=['tab:blue', 'tab:red'], legend=False)

plt.title(f"BIC = {bic_score} (Random Labeling)")
sns.despine()
plt.show()

../_images/comp_bic_score_2_output_3_0.png