aaanalysis.AAclust.name_clusters

static AAclust.name_clusters(X, labels=None, names=None, shorten_names=True)[source]

Assigns names to clusters based on the frequency of names.

Names with higher frequency are prioritized. If a name is already assigned to a cluster, or the cluster contains one sample, its name is set to ‘Unclassified’.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to scales and columns to amino acids.

  • labels (array-like, shape (n_samples,)) – Cluster labels for each sample in X.

  • names (list of str) – List of sample names corresponding to X.

  • shorten_names (bool, default=True) – If True, shorten version of the names will be used.

Returns:

cluster_names – A list of renamed clusters based on names.

Return type:

list of str

Examples

We first create an example dataset of 100 scales and obtain their AAontolgy subcategory names to showcase the automatic cluster naming by AAclust().name_clusters() method:

import aaanalysis as aa
# Create example dataset comprising 100 scales
df_scales = aa.load_scales().T.sample(100).T
X = df_scales.T
df_cat = aa.load_scales(name="scales_cat")
dict_scale_name = dict(zip(df_cat["scale_id"], df_cat["subcategory"]))
names = [dict_scale_name[s] for s in list(df_scales)]
# Fit AAclust model and obtain clustering label for 10 clusters
aac = aa.AAclust()
aac.fit(X, n_clusters=7)
labels = aac.labels_

We can now provide the feature matrix X, names, and labels to the AAclust().name_clusters() method:

cluster_names = aac.name_clusters(X, labels=labels, names=names)
print("Name of clusters:\n", list(sorted(set(cluster_names))))
Name of clusters:
 ['Accessible surface area', 'Buried', 'Hydrophobicity', 'Side chain length', 'α-helix', 'α-helix (α-proteins)', 'β-turn']

These names are automatically shorten, which can be disabled by setting shorten_names=False:

cluster_names = aac.name_clusters(X, labels=labels, names=names, shorten_names=False)
print("Longer names:\n", list(sorted(set(cluster_names))))
Longer names:
 ['AA composition', 'Accessible surface area (ASA)', 'Buried', 'Side chain length', 'α-helix', 'β-sheet', 'β-turn']