aaanalysis.comp_bic_score

aaanalysis.comp_bic_score(X=None, labels=None)[source]

Compute an adjusted Bayesian Information Criterion (BIC) (-∞, ∞) for assessing clustering quality.

Described in [Breimann24b], this adjusted BIC is computed for a given set of clusters in the dataset X. The BIC is a clustering model selection criterion that balances the model complexity against the likelihood of the data distribution. Unlike the traditional BIC where lower values are better, this adjusted BIC, is modified to align with other clustering evaluation measures like the Silhouette coefficient and the Calinski-Harabasz score. In this adjusted version, higher values indicate better clustering.

Added in version 1.0.0.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. ‘Rows’ typically correspond to proteins and ‘columns’ to features.

  • labels (array-like, shape (n_samples,)) – Predicted labels for each sample. Each label corresponds to a cluster.

Returns:

bic – The Bayesian Information Criterion value. A lower BIC value indicates a better model fit to the data.

Return type:

float

Notes

  • An epsilon value (1e-10) is utilized to prevent division by zero in the computation.

See also

Examples

The Bayesian Information Criterion (BIC) [-∞, ∞] for a given set of clusters in the dataset X can be computed using the comp_bic_score() function. As introduced in [Breimann24a], the BIC was adjusted so that higher values indicate better clustering results:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import aaanalysis as aa

# Generate random 2D data for two distinct groups
group_blue = np.random.normal(loc=[-1, -1], scale=[0.5, 0.5], size=(1000, 2))
group_red = np.random.normal(loc=[1, 1], scale=[0.5, 0.5], size=(1000, 2))

# Combine data into a single dataset and create labels
X = np.vstack([group_blue, group_red])
labels = np.array([0]*1000 + [1]*1000)

# Compute BIC score
bic_score = round(aa.comp_bic_score(X, labels), 3)

# Create a DataFrame for Seaborn
df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
df['Label'] = labels

# Plot using seaborn
aa.plot_settings()
sns.scatterplot(data=df, x='Feature 1', y='Feature 2', hue='Label',
                palette=['tab:blue', 'tab:red'], legend=False)

plt.title(f"BIC = {bic_score} (Perfect Labeling)")
sns.despine()
plt.show()
../_images/comp_bic_score_1_output_1_0.png

Labeling both groups randomly is dramatically decreasing the bic_score:

# Random labeling
np.random.shuffle(labels)
df['Label'] = labels

# Compute BIC score
bic_score = round(aa.comp_bic_score(X, labels), 3)


# Plot using seaborn
sns.scatterplot(data=df, x='Feature 1', y='Feature 2', hue='Label',
                palette=['tab:blue', 'tab:red'], legend=False)

plt.title(f"BIC = {bic_score} (Random Labeling)")
sns.despine()
plt.show()
../_images/comp_bic_score_2_output_3_0.png