aaanalysis.AAclust.fit

AAclust.fit(X, n_clusters=None, on_center=True, min_th=0.3, merge=True, metric='euclidean', names=None)[source]

Applies AAclust algorithm to feature matrix (X).

Introduced in [Breimann24a], AAclust determines the optimal number of clusters, k, without pre-specification. It partitions data (X) into clusters by maximizing the within-cluster Pearson correlation beyond the min_th threshold. The quality of clustering is either based on the minimum Pearson correlation of all members (on_center=False) or between the cluster center and its members (on_center=True), using either the min_cor_all or min_cor_center correlation measures, respectively.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to scales and columns to amino acids.

  • n_clusters (int, optional) – Pre-defined number of clusters. If provided, k is not optimized. Must be 0 > n_clusters > n_samples.

  • min_th (float, default=0.3) – Pearson correlation threshold for clustering optimization (between 0 and 1).

  • on_center (bool, default=True) – If True, min_th is applied to the cluster center. Otherwise, to all cluster members.

  • merge (bool, default=True) – If True, the optional merging step is performed.

  • metric ({'correlation', 'euclidean', 'manhattan', 'cosine'}, default='euclidean') –

    Similarity measure used for optional cluster merging and obtaining medoids:

    • correlation: Pearson correlation (maximum)

    • euclidean: Euclidean distance (minimum)

    • manhattan: Manhattan distance (minimum)

    • cosine: Cosine distance (minimum)

  • names (list of str, optional) – List of sample names. If provided, sets AAclust.medoid_names_ attribute.

Returns:

The fitted instance of the AAclust class, allowing direct attribute access.

Return type:

AAclust

Notes

  • The AAclust algorithm consists of three main steps:

    1. Estimate the lower bound of k.

    2. Refine k (recursively) using the chosen quality measure.

    3. Optionally, merge smaller clusters as directed by the merge metric.

  • AAclust provides two correlation-based quality measure to optimize n_clusters:

    • min_cor_center: Minimum Pearson correlation between the cluster center and all cluster members.

    • min_cor_all: Minium pairwise Pearson correlation among all cluster members.

  • A representative scale (medoid) closest to each cluster center is selected for redundancy reduction.

See also

Warning

  • All RuntimeWarnings during the AAclust algorithm are caught and bundled into one RuntimeWarning.

Examples

We load an example scale dataset to showcase the AAclust().fit() method:

import aaanalysis as aa
aa.options["verbose"] = False

# Create test dataset of 25 amino acid scales
df_scales = aa.load_scales().T.sample(25).T
X = df_scales.T

By fitting AAclust, its three-step algorithm is performed to select an optimized n_clusters (k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes:

# Fit clustering model
aac = aa.AAclust()
aac.fit(X)
# Get output parameters
n_clusters = aac.n_clusters
print("n_clusters: ", n_clusters)
labels = aac.labels_
print("Labels: ", labels)
centers = aac.centers_ # Cluster centers (average scales for each cluster)
labels_centers = aac.labels_centers_
medoids = aac.medoids_ # Representative scale for each cluster
labels_medoids = aac.labels_medoids_
print("Labels of medoids: ", labels_medoids)
is_medoid = aac.is_medoid_
df_scales_medoids = df_scales.T[is_medoid].T
aa.display_df(df_scales_medoids, show_shape=True, n_rows=5)
n_clusters:  4
Labels:  [0 0 0 1 1 3 2 0 2 1 2 0 1 1 0 3 1 1 1 0 1 0 0 2 2]
Labels of medoids:  [0 1 3 2]
DataFrame shape: (20, 4)
  ISOY800107 MIYS850101 MIYS990103 EISD860101
AA        
A 0.482000 0.360000 0.500000 0.589000
C 0.518000 0.678000 0.029000 0.528000
D 0.637000 0.140000 0.786000 0.191000
E 0.914000 0.162000 0.871000 0.285000
F 0.155000 1.000000 0.057000 0.936000

names can be provided to the AAclust().fit() method to retrieve the names of the medoids:

names = [f"scale {i+1}" for i in range(len(df_scales.T))]
aac.fit(X, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)
Name of medoid scales:
['scale 10', 'scale 15', 'scale 4']

The n_clusters parameter can as well be pre-defined:

aac.fit(X, n_clusters=7, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)
Name of medoid scales:
['scale 20', 'scale 15', 'scale 22', 'scale 14', 'scale 6', 'scale 24', 'scale 9']

The second step of the AAclust algorithm (recursive k optimization) can be adjusted using the min_th and on_center parameters:

# Pearson correlation within all cluster members >= 0.5
aac.fit(X, on_center=False, min_th=0.5)
print("n clusters (pairwise correlation): ", aac.n_clusters)
# Pearson correlation between all cluster members and the respective center >= 0.5
aac.fit(X, on_center=True, min_th=0.5)
print("n clusters (center correlation): ", aac.n_clusters)
# The latter is less strict, leading to bigger and thus fewer clusters
n clusters (pairwise correlation):  10
n clusters (center correlation):  5

The third and optional merging step can be adjusted using the metric parameter and disabled setting merge=False. The attributes can be directly retrieved since the AAclust.fit() method returns the fitted clustering model:

# Load over 500 scales
X = aa.load_scales().T
n_with_merging_euclidean = aac.fit(X, metric="euclidean").n_clusters
n_without_merging_euclidean = aac.fit(X, merge=False, metric="euclidean").n_clusters
n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters
n_without_merging_cosine = aac.fit(X, merge=False, metric="cosine").n_clusters
print("n clusters (merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (no merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (merging, cosine): ", n_with_merging_cosine)
print("n clusters (no merging, cosine): ", n_without_merging_cosine)
n clusters (merging, euclidean):  54
n clusters (no merging, euclidean):  54
n clusters (merging, cosine):  52
n clusters (no merging, cosine):  59