AAclustPlot.medoids

AAclustPlot.medoids(X=None, *, labels, df_scales=None, component_x=1, component_y=2, metric='euclidean', ax=None, figsize=(7, 6), legend=True, dot_size=100, dot_alpha=0.75, palette=None)[source]

Principal Component Analysis (PCA) plot of clustering with medoids highlighted.

Projects X into the chosen PCA components and overlays the cluster medoids (computed via AAclust.comp_medoids()) as distinct markers, coloured by cluster label, so that the representative sample of each cluster is immediately visible.

Added in version 0.1.0.

Parameters:
  • X (array-like, shape (n_samples, n_features), optional) – Feature matrix with the clustered samples in rows and features in columns, used as-is (no transpose). This is the form for proteins and other per-sample matrices — e.g. a CPP feature matrix, pooled protein embeddings, or structural features, one row per protein. For amino acid scales pass df_scales instead. Provide exactly one of X or df_scales.

  • labels (array-like, shape (n_samples,)) – Cluster labels (typically aac.labels_), one per clustered sample — i.e. per row of X or per column of df_scales. If None, no grouping is used.

  • df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – Amino acid scales DataFrame with amino acids in rows and scale IDs in columns, as returned by load_scales(). It is transposed internally to the feature matrix, so you never call .T yourself. Provide exactly one of X or df_scales.

  • component_x (int, default=1) – Index of the PCA component for the x-axis. Must be >= 1.

  • component_y (int, default=2) – Index of the PCA component for the y-axis. Must be >= 1.

  • metric ({'correlation', 'euclidean', 'manhattan', 'cosine'}, default='euclidean') –

    The distance metric for calculating medoid.

    • correlation: Pearson correlation (maximum)

    • euclidean: Euclidean distance (minimum)

    • manhattan: Manhattan distance (minimum)

    • cosine: Cosine distance (minimum)

  • ax (Axes, optional) – Pre-defined Axes object to plot on. If None, a new Axes object is created.

  • figsize (tuple, default=(7, 6)) – Figure dimensions (width, height) in inches.

  • legend (bool, default=True) – Whether to show the legend.

  • dot_size (int, default=100) – Size of the plotted dots.

  • dot_alpha (float or int, default=0.75) – The transparency alpha value [0-1] of the plotted dots.

  • palette (list, optional) – Colormap for the labels or list of colors. If None, a default colormap is used.

Returns:

  • fig (Figure) – Figure object containing the PCA plot.

  • ax (Axes) – PCA plot axes object.

Notes

  • Returned as a (fig, ax) pair (see AAclustPlot for the shared return contract); the PCA-component DataFrame shown here is stored on the instance as df_components_.

  • Pass scales via df_scales (transposed for you) and proteins / embeddings / CPP features via X (a samples-by-features matrix, used as-is) — never transpose manually.

  • Ensure X and labels are in the same order to avoid mislabeling.

See also

Examples

We first create an example dataset for the AAclustPlot().medoids(), which visualizes ‘medoids’ as obtained by the AAclust().comp_medoids() method:

from sklearn.decomposition import PCA, KernelPCA, FastICA, TruncatedSVD, NMF
import matplotlib.pyplot as plt
import aaanalysis as aa
aa.options["verbose"] = False
# Obtain example scale dataset
df_scales = aa.load_scales()
X = df_scales.T
# Fit AAclust model retrieve labels to compute centers
aac = aa.AAclust()
labels = aac.fit(X, n_clusters=5).labels_

All data points are visualized in the PCA plot including the cluster representative samples (‘medoids’) highlighted by a bigger dot. The scales DataFrame can be passed directly via df_scales (transposed internally), so the manual df_scales.T is no longer needed — only the cluster labels are supplied:

aac_plot = aa.AAclustPlot(model_class=PCA)
aa.plot_settings()
fig, ax = aac_plot.medoids(df_scales=df_scales, labels=labels)
plt.show()
# DataFrame for respective components are returned
df_components = aac_plot.df_components_
aa.display_df(df_components, n_rows=10, show_shape=True)
../_images/aac_plot_medoids_1_output_3_0.png
DataFrame shape: (586, 2)
  PC1 (33.6%) PC2 (17.7%)
1 -0.181292 0.579504
2 0.823876 -0.591823
3 0.723627 -0.838029
4 0.860664 -0.746315
5 0.645413 0.481089
6 1.266436 -0.148832
7 -0.753006 0.412799
8 -1.074425 0.348078
9 0.501059 0.261917
10 1.304114 -0.139382

Equivalently, pass a feature matrix directly as X (samples × features, used as-is — no transpose). This is the form for proteins / embeddings / CPP features (one row per protein). Here X is df_scales transposed, so it matches the df_scales call above:

aac_plot.medoids(X, labels=labels)
plt.show()
../_images/aac_plot_medoids_2_output_5_0.png

Select other PCs using the component_x and component_y parameters:

aac_plot.medoids(X, labels=labels, component_x=3, component_y=4)
plt.show()
../_images/aac_plot_medoids_3_output_7_0.png

Medoids can be obtained using different kinds of metric. To compare them in a single plot, you can use the ax and legend parameters:

list_metrics = ["correlation", "euclidean", "manhattan", "cosine"]
fig, axes = plt.subplots(4, 1, figsize=(7, 14), sharex=True, sharey=True)
for i, metric in enumerate(list_metrics):
    ax = axes[i]
    # Set legend only for first subplot
    aac_plot.medoids(X, labels=labels, ax=ax, legend=i==0, metric=metric)
    ax.set_title(metric)
plt.tight_layout()
plt.show()
plt.close()
../_images/aac_plot_medoids_4_output_9_0.png

Adjust the style of the scatter plot using the dot_size and dot_alpha arguments to change the size of the dots and their transparency:

aac_plot = aa.AAclustPlot(model_class=PCA)
aac_plot.medoids(X, labels=labels, dot_size=50, dot_alpha=1)
plt.show()
../_images/aac_plot_medoids_5_output_11_0.png

The cluster colors can be adjusted by the palette argument by providing either a list of colors or a color map:

colors = aa.plot_get_clist(n_colors=5)
aac_plot.medoids(X, labels=labels, palette=colors)
plt.show()
aac_plot.medoids(X, labels=labels, palette="viridis")
plt.show()
../_images/aac_plot_medoids_6_output_13_0.png ../_images/aac_plot_medoids_7_output_13_1.png