AAclust.fit

AAclust.fit(X, n_clusters=None, on_center=True, min_th=0.3, merge=True, metric='euclidean', names=None)[source]

Applies AAclust algorithm to feature matrix (X).

Introduced in [Breimann24a], AAclust determines the optimal number of clusters, k, without pre-specification. It partitions data (X) into clusters by maximizing the within-cluster Pearson correlation beyond the min_th threshold. The quality of clustering is either based on the minimum Pearson correlation of all members (on_center=False) or between the cluster center and its members (on_center=True), using either the min_cor_all or min_cor_center correlation measures, respectively.

Added in version 0.1.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to scales and columns to amino acids.
n_clusters (int, optional) – Pre-defined number of clusters. If provided, k is not optimized. Must be 0 < n_clusters < n_samples.
min_th (float, default=0.3) – Pearson correlation threshold for clustering optimization (between 0 and 1).
on_center (bool, default=True) – If True, min_th is applied to the cluster center. Otherwise, to all cluster members.
merge (bool, default=True) – If True, the optional merging step is performed.
metric ({'correlation', 'euclidean', 'manhattan', 'cosine'}, default='euclidean') –
Similarity measure used for optional cluster merging and obtaining medoids:
- correlation: Pearson correlation (maximum)
- euclidean: Euclidean distance (minimum)
- manhattan: Manhattan distance (minimum)
- cosine: Cosine distance (minimum)
names (list of str, optional) – List of sample names. If provided, sets AAclust.medoid_names_ attribute.

Returns:

The fitted instance of the AAclust class, allowing direct attribute access.

Return type:

AAclust

Notes

The AAclust algorithm consists of three main steps:
1. Estimate the lower bound of k.
2. Refine k (recursively) using the chosen quality measure.
3. Optionally, merge smaller clusters as directed by the merge metric.
AAclust provides two correlation-based quality measure to optimize n_clusters:
- min_cor_center: Minimum Pearson correlation between the cluster center and all cluster members.
- min_cor_all: Minium pairwise Pearson correlation among all cluster members.
A representative scale (medoid) closest to each cluster center is selected for redundancy reduction.

See also

sklearn.metrics.pairwise_distances() were used as distances for merging.

Warning

All RuntimeWarnings during the AAclust algorithm are caught and bundled into one RuntimeWarning.

Examples

We load an example scale dataset to showcase the AAclust().fit() method:

import aaanalysis as aa
aa.options["verbose"] = False

# Create test dataset of 25 amino acid scales
df_scales = aa.load_scales().T.sample(25).T
X = df_scales.T

By fitting AAclust, its three-step algorithm is performed to select an optimized n_clusters (k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes:

# Fit clustering model
aac = aa.AAclust()
aac.fit(X)
# Get output parameters
n_clusters = aac.n_clusters
print("n_clusters: ", n_clusters)
labels = aac.labels_
print("Labels: ", labels)
centers = aac.centers_ # Cluster centers (average scales for each cluster)
labels_centers = aac.labels_centers_
medoids = aac.medoids_ # Representative scale for each cluster
labels_medoids = aac.labels_medoids_
print("Labels of medoids: ", labels_medoids)
is_medoid = aac.is_medoid_
df_scales_medoids = df_scales.T[is_medoid].T
aa.display_df(df_scales_medoids, show_shape=True, n_rows=5)

n_clusters:  4
Labels:  [0 0 0 1 1 3 2 0 2 1 2 0 1 1 0 3 1 1 1 0 1 0 0 2 2]
Labels of medoids:  [0 1 3 2]
DataFrame shape: (20, 4)

	ISOY800107	MIYS850101	MIYS990103	EISD860101
AA
A	0.482000	0.360000	0.500000	0.589000
C	0.518000	0.678000	0.029000	0.528000
D	0.637000	0.140000	0.786000	0.191000
E	0.914000	0.162000	0.871000	0.285000
F	0.155000	1.000000	0.057000	0.936000

names can be provided to the AAclust().fit() method to retrieve the names of the medoids:

names = [f"scale {i+1}" for i in range(len(df_scales.T))]
aac.fit(X, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)

Name of medoid scales:
['scale 10', 'scale 15', 'scale 4']

The n_clusters parameter can as well be pre-defined:

aac.fit(X, n_clusters=7, names=names)
medoid_names = aac.medoid_names_
print("Name of medoid scales:")
print(medoid_names)

Name of medoid scales:
['scale 20', 'scale 15', 'scale 22', 'scale 14', 'scale 6', 'scale 24', 'scale 9']

The second step of the AAclust algorithm (recursive k optimization) can be adjusted using the min_th and on_center parameters:

# Pearson correlation within all cluster members >= 0.5
aac.fit(X, on_center=False, min_th=0.5)
print("n clusters (pairwise correlation): ", aac.n_clusters)
# Pearson correlation between all cluster members and the respective center >= 0.5
aac.fit(X, on_center=True, min_th=0.5)
print("n clusters (center correlation): ", aac.n_clusters)
# The latter is less strict, leading to bigger and thus fewer clusters

n clusters (pairwise correlation):  10
n clusters (center correlation):  5

The third and optional merging step can be adjusted using the metric parameter and disabled setting merge=False. The attributes can be directly retrieved since the AAclust.fit() method returns the fitted clustering model:

# Load over 500 scales
X = aa.load_scales().T
n_with_merging_euclidean = aac.fit(X, metric="euclidean").n_clusters
n_without_merging_euclidean = aac.fit(X, merge=False, metric="euclidean").n_clusters
n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters
n_without_merging_cosine = aac.fit(X, merge=False, metric="cosine").n_clusters
print("n clusters (merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (no merging, euclidean): ", n_with_merging_euclidean)
print("n clusters (merging, cosine): ", n_with_merging_cosine)
print("n clusters (no merging, cosine): ", n_without_merging_cosine)

n clusters (merging, euclidean):  54
n clusters (no merging, euclidean):  54
n clusters (merging, cosine):  52
n clusters (no merging, cosine):  59