AAclust

class AAclust(model_class=<class 'sklearn.cluster._kmeans.KMeans'>, model_kwargs=None, verbose=True, random_state=None)[source]

Bases: Wrapper

Amino Acid clustering (AAclust) class: A k-optimized clustering wrapper for selecting redundancy-reduced sets of numerical scales [Breimann24a].

AAclust uses clustering models that require a pre-defined number of clusters (k, set by n_clusters), such as k-means or other scikit-learn clustering models. It optimizes the value of k by utilizing Pearson correlation and then selects a representative sample (‘medoid’) for each cluster closest to the center, resulting in a redundancy-reduced sample set.

Added in version 0.1.0.

model

The fitted clustering model object after calling the fit method.

Type:

object

n_clusters

Number of clusters obtained by AAclust.

Type:

int

labels_

Cluster labels in the order of samples in X.

Type:

array-like, shape (n_samples)

centers_

Average scale values corresponding to each cluster.

Type:

array-like, shape (n_clusters, n_features)

labels_centers_

Cluster labels for each cluster center.

Type:

array-like, shape (n_clusters)

medoids_

Representative samples, one for each cluster.

Type:

array-like, shape (n_clusters, n_features)

medoid_ind_

Indices of the medoid samples in X, aligned row-for-row with medoids_, labels_medoids_, and medoid_names_. Always set after fitting.

Type:

array-like, shape (n_clusters)

labels_medoids_

Cluster labels for each medoid.

Type:

array-like, shape (n_clusters)

is_medoid_

Array indicating samples being medoids (1) or not (0). Same order as labels_.

Type:

array-like, shape (n_samples)

medoid_names_

Names of the medoid samples, aligned with medoid_ind_. None unless names is passed to .fit; use medoid_ind_ for integer positions otherwise.

Type:

list

Parameters:

Methods

comp_centers(X, labels)

Computes the center of each cluster based on the given labels.

comp_correlation(X, labels[, X_ref, ...])

Computes the Pearson correlation of given data with reference data.

comp_coverage(names, names_ref)

Computes the percentage of unique names from names that are present in names_ref.

comp_medoids(X, labels[, metric])

Computes the medoid of each cluster based on the given labels.

eval(X, list_labels[, names_datasets])

Evaluates the quality of different clustering results.

filter_coverage(X, scale_ids, names_ref[, ...])

Select a redundancy-reduced set of numerical scales with defined subcategory coverage.

fit(X[, n_clusters, on_center, min_th, ...])

Applies AAclust algorithm to feature matrix (X).

name_clusters(X, labels[, names, shorten_names])

Assigns names to clusters based on the frequency of names.

select_proteins(df_seq, X[, n_clusters, ...])

Select a redundancy-reduced set of proteins from a per-protein feature matrix.

select_scales(df_scales[, n_clusters, ...])

Select a redundancy-reduced subset of scales directly from a scales DataFrame.

__init__(model_class=<class 'sklearn.cluster._kmeans.KMeans'>, model_kwargs=None, verbose=True, random_state=None)[source]
Parameters:
  • model_class (Type[ClusterMixin], default=KMeans) – A clustering model class with n_clusters parameter.

  • model_kwargs (dict, optional) – Keyword arguments to pass to the selected clustering model.

  • verbose (bool, default=True) – If True, verbose outputs are enabled.

  • random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random.

Notes

  • All attributes are set during fitting via the AAclust.fit() method and can be directly accessed.

  • AAclust is designed primarily for amino acid scales but can be used for any set of numerical indices.

See also

Examples

The AAclust clustering wrapper class can utilize any clustering model that uses the n_clusters parameter:

from sklearn.cluster import (KMeans, AgglomerativeClustering, MiniBatchKMeans, SpectralClustering)
import aaanalysis as aa

# AAclust with KMens (default)
aac = aa.AAclust(model_class=KMeans)
# AAclust with MiniBatchKMeans
aac = aa.AAclust(model_class=MiniBatchKMeans)
# AAclust with SpectralClustering
aac = aa.AAclust(model_class=SpectralClustering)

The hierarchical agglomerative clustering model utilizes four different linkage measures, which can be provided to AAclustby its model_kwargs parameter:

# AAclust using AgglomerativeClustering with Euclidean distance
aac = aa.AAclust(model_class=AgglomerativeClustering, model_kwargs=dict(linkage='average'))
# Other linkage methods are 'ward', 'complete', and 'single'

Further parameters. AAclust.__init__ also accepts: verbose; random_state.