AAclust

class AAclust(model_class=<class 'sklearn.cluster._kmeans.KMeans'>, model_kwargs=None, verbose=True, random_state=None)[source]

Bases: Wrapper

Amino Acid clustering (AAclust) class: A k-optimized clustering wrapper for selecting redundancy-reduced sets of numerical scales [Breimann24a].

AAclust uses clustering models that require a pre-defined number of clusters (k, set by n_clusters), such as k-means or other scikit-learn clustering models. It optimizes the value of k by utilizing Pearson correlation and then selects a representative sample (‘medoid’) for each cluster closest to the center, resulting in a redundancy-reduced sample set.

Added in version 0.1.0.

model

The fitted clustering model object after calling the fit method.

Type:: object

n_clusters

Number of clusters obtained by AAclust.

Type:: int

labels_

Cluster labels in the order of samples in X.

Type:: array-like, shape (n_samples)

centers_

Average scale values corresponding to each cluster.

Type:: array-like, shape (n_clusters, n_features)

labels_centers_

Cluster labels for each cluster center.

Type:: array-like, shape (n_clusters)

medoids_

Representative samples, one for each cluster.

Type:: array-like, shape (n_clusters, n_features)

medoid_ind_

Indices of the medoid samples in X, aligned row-for-row with medoids_, labels_medoids_, and medoid_names_. Always set after fitting.

Type:: array-like, shape (n_clusters)

labels_medoids_

Cluster labels for each medoid.

Type:: array-like, shape (n_clusters)

is_medoid_

Array indicating samples being medoids (1) or not (0). Same order as labels_.

Type:: array-like, shape (n_samples)

medoid_names_

Names of the medoid samples, aligned with medoid_ind_. None unless names is passed to .fit; use medoid_ind_ for integer positions otherwise.

Type:: list

Parameters:

model_class (Type[ClusterMixin])
model_kwargs (Optional[Dict])
verbose (bool)
random_state (Optional[int])

Methods

`comp_centers`(X, labels)	Computes the center of each cluster based on the given labels.
`comp_correlation`(X, labels[, X_ref, ...])	Computes the Pearson correlation of given data with reference data.
`comp_coverage`(names, names_ref)	Computes the percentage of unique names from `names` that are present in `names_ref`.
`comp_medoids`(X, labels[, metric])	Computes the medoid of each cluster based on the given labels.
`eval`(X, list_labels[, names_datasets])	Evaluates the quality of different clustering results.
`filter_coverage`(X, scale_ids, names_ref[, ...])	Select a redundancy-reduced set of numerical scales with defined subcategory coverage.
`fit`(X[, n_clusters, on_center, min_th, ...])	Applies AAclust algorithm to feature matrix (`X`).
`name_clusters`(X, labels[, names, shorten_names])	Assigns names to clusters based on the frequency of names.
`pre_select_scales`(df_scales[, df_cat, ...])	Pre-select scales by excluding AAontology categories and subcategories.
`select_proteins`(df_seq, X[, n_clusters, ...])	Select a redundancy-reduced set of proteins from a per-protein feature matrix.
`select_scales`(df_scales[, n_clusters, ...])	Select a redundancy-reduced subset of scales directly from a scales DataFrame.

__init__(model_class=<class 'sklearn.cluster._kmeans.KMeans'>, model_kwargs=None, verbose=True, random_state=None)[source]

Parameters:

model_class (Type[ClusterMixin], default=KMeans) – A clustering model class with n_clusters parameter.
model_kwargs (dict, optional) – Keyword arguments to pass to the selected clustering model.
verbose (bool, default=True) – If True, verbose outputs are enabled.
random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random.

Notes

All attributes are set during fitting via the AAclust.fit() method and can be directly accessed.
AAclust is designed primarily for amino acid scales but can be used for any set of numerical indices.

See also

AAclustPlot: the respective plotting class.
Scikit-learn clustering model classes.

Examples

The AAclust clustering wrapper class can utilize any clustering model that uses the n_clusters parameter:

from sklearn.cluster import (KMeans, AgglomerativeClustering, MiniBatchKMeans, SpectralClustering)
import aaanalysis as aa

# AAclust with KMens (default)
aac = aa.AAclust(model_class=KMeans)
# AAclust with MiniBatchKMeans
aac = aa.AAclust(model_class=MiniBatchKMeans)
# AAclust with SpectralClustering
aac = aa.AAclust(model_class=SpectralClustering)

The hierarchical agglomerative clustering model utilizes four different linkage measures, which can be provided to AAclustby its model_kwargs parameter:

# AAclust using AgglomerativeClustering with Euclidean distance
aac = aa.AAclust(model_class=AgglomerativeClustering, model_kwargs=dict(linkage='average'))
# Other linkage methods are 'ward', 'complete', and 'single'

Further parameters. AAclust.__init__ also accepts: verbose; random_state.

# Further parameters: verbose toggles logging; random_state seeds stochastic clustering models
aac = aa.AAclust(verbose=False, random_state=42)