aaanalysis.AAclust.fit
- AAclust.fit(X, n_clusters=None, on_center=True, min_th=0.3, merge=True, metric='euclidean', names=None)[source]
Applies AAclust algorithm to feature matrix (
X).Introduced in [Breimann24a], AAclust determines the optimal number of clusters, k, without pre-specification. It partitions data (
X) into clusters by maximizing the within-cluster Pearson correlation beyond themin_ththreshold. The quality of clustering is either based on the minimum Pearson correlation of all members (on_center=False) or between the cluster center and its members (on_center=True), using either themin_cor_allormin_cor_centercorrelation measures, respectively.- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to scales and columns to amino acids.
n_clusters (int, optional) – Pre-defined number of clusters. If provided, k is not optimized. Must be 0 > n_clusters > n_samples.
min_th (float, default=0.3) – Pearson correlation threshold for clustering optimization (between 0 and 1).
on_center (bool, default=True) – If
True,min_this applied to the cluster center. Otherwise, to all cluster members.merge (bool, default=True) – If
True, the optional merging step is performed.metric ({'correlation', 'euclidean', 'manhattan', 'cosine'}, default='euclidean') –
Similarity measure used for optional cluster merging and obtaining medoids:
correlation: Pearson correlation (maximum)euclidean: Euclidean distance (minimum)manhattan: Manhattan distance (minimum)cosine: Cosine distance (minimum)
names (list of str, optional) – List of sample names. If provided, sets
AAclust.medoid_names_attribute.
- Returns:
The fitted instance of the AAclust class, allowing direct attribute access.
- Return type:
Notes
The AAclust algorithm consists of three main steps:
Estimate the lower bound of k.
Refine k (recursively) using the chosen quality measure.
Optionally, merge smaller clusters as directed by the merge
metric.
AAclust provides two correlation-based quality measure to optimize
n_clusters:min_cor_center: Minimum Pearson correlation between the cluster center and all cluster members.min_cor_all: Minium pairwise Pearson correlation among all cluster members.
A representative scale (medoid) closest to each cluster center is selected for redundancy reduction.
See also
sklearn.metrics.pairwise_distances()were used as distances for merging.
Warning
All RuntimeWarnings during the AAclust algorithm are caught and bundled into one RuntimeWarning.
Examples
We load an example scale dataset to showcase the
AAclust().fit()method:import aaanalysis as aa aa.options["verbose"] = False # Create test dataset of 25 amino acid scales df_scales = aa.load_scales().T.sample(25).T X = df_scales.T
By fitting
AAclust, its three-step algorithm is performed to select an optimizedn_clusters(k). The three steps involve (1) an estimation of lower bound of k, (2) refinement of k, and (3) an optional clustering merging. Various results are saved as attributes:# Fit clustering model aac = aa.AAclust() aac.fit(X) # Get output parameters n_clusters = aac.n_clusters print("n_clusters: ", n_clusters) labels = aac.labels_ print("Labels: ", labels) centers = aac.centers_ # Cluster centers (average scales for each cluster) labels_centers = aac.labels_centers_ medoids = aac.medoids_ # Representative scale for each cluster labels_medoids = aac.labels_medoids_ print("Labels of medoids: ", labels_medoids) is_medoid = aac.is_medoid_ df_scales_medoids = df_scales.T[is_medoid].T aa.display_df(df_scales_medoids, show_shape=True, n_rows=5)n_clusters: 4 Labels: [0 0 0 1 1 3 2 0 2 1 2 0 1 1 0 3 1 1 1 0 1 0 0 2 2] Labels of medoids: [0 1 3 2] DataFrame shape: (20, 4)
ISOY800107 MIYS850101 MIYS990103 EISD860101 AA A 0.482000 0.360000 0.500000 0.589000 C 0.518000 0.678000 0.029000 0.528000 D 0.637000 0.140000 0.786000 0.191000 E 0.914000 0.162000 0.871000 0.285000 F 0.155000 1.000000 0.057000 0.936000 namescan be provided to theAAclust().fit()method to retrieve the names of the medoids:names = [f"scale {i+1}" for i in range(len(df_scales.T))] aac.fit(X, names=names) medoid_names = aac.medoid_names_ print("Name of medoid scales:") print(medoid_names)Name of medoid scales: ['scale 10', 'scale 15', 'scale 4']
The
n_clustersparameter can as well be pre-defined:aac.fit(X, n_clusters=7, names=names) medoid_names = aac.medoid_names_ print("Name of medoid scales:") print(medoid_names)Name of medoid scales: ['scale 20', 'scale 15', 'scale 22', 'scale 14', 'scale 6', 'scale 24', 'scale 9']
The second step of the
AAclustalgorithm (recursive k optimization) can be adjusted using themin_thandon_centerparameters:# Pearson correlation within all cluster members >= 0.5 aac.fit(X, on_center=False, min_th=0.5) print("n clusters (pairwise correlation): ", aac.n_clusters) # Pearson correlation between all cluster members and the respective center >= 0.5 aac.fit(X, on_center=True, min_th=0.5) print("n clusters (center correlation): ", aac.n_clusters) # The latter is less strict, leading to bigger and thus fewer clustersn clusters (pairwise correlation): 10 n clusters (center correlation): 5
The third and optional merging step can be adjusted using the
metricparameter and disabled settingmerge=False. The attributes can be directly retrieved since theAAclust.fit()method returns the fitted clustering model:# Load over 500 scales X = aa.load_scales().T n_with_merging_euclidean = aac.fit(X, metric="euclidean").n_clusters n_without_merging_euclidean = aac.fit(X, merge=False, metric="euclidean").n_clusters n_with_merging_cosine = aac.fit(X, metric="cosine").n_clusters n_without_merging_cosine = aac.fit(X, merge=False, metric="cosine").n_clusters print("n clusters (merging, euclidean): ", n_with_merging_euclidean) print("n clusters (no merging, euclidean): ", n_with_merging_euclidean) print("n clusters (merging, cosine): ", n_with_merging_cosine) print("n clusters (no merging, cosine): ", n_without_merging_cosine)n clusters (merging, euclidean): 54 n clusters (no merging, euclidean): 54 n clusters (merging, cosine): 52 n clusters (no merging, cosine): 59