AAclust.select_proteins

AAclust.select_proteins(df_seq, X, n_clusters=None, on_center=True, min_th=0.3, metric='euclidean', return_data='annotated')[source]

Select a redundancy-reduced set of proteins from a per-protein feature matrix.

Clusters proteins by their numerical representation in X (e.g. a CPP feature matrix, pooled per-protein embeddings, or structural/DSSP-derived features) and selects one representative protein (the medoid) per cluster [Breimann24a]. X must be pre-pooled to one row per protein; pooling per-residue inputs to a per-protein vector is left to the caller. The result is reported on df_seq in the same style as sequence-based redundancy reduction: a cluster label, an is_representative flag, and the dist_to_rep distance of each protein to its representative.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_proteins, n_columns)) – DataFrame containing an entry column with unique protein identifiers, row-aligned to X (row i of df_seq describes protein i in X).

  • X (array-like, shape (n_proteins, n_features)) – Per-protein feature matrix. Rows correspond to proteins and columns to features.

  • n_clusters (int, optional) – Pre-defined number of clusters (selected proteins). If provided, k is not optimized. Must be 0 < n_clusters < n_proteins.

  • min_th (float, default=0.3) – Pearson correlation threshold for clustering optimization (between 0 and 1).

  • on_center (bool, default=True) – If True, min_th is applied to the cluster center. Otherwise, to all cluster members.

  • metric ({'correlation', 'euclidean', 'manhattan', 'cosine'}, default='euclidean') –

    Similarity measure used for obtaining medoids and computing dist_to_rep:

    • correlation: Pearson correlation (maximum)

    • euclidean: Euclidean distance (minimum)

    • manhattan: Manhattan distance (minimum)

    • cosine: Cosine distance (minimum)

  • return_data ({'annotated', 'filtered', 'both'}, default='annotated') –

    Controls the returned data:

    • annotated: df_seq with three added columns (one row per protein).

    • filtered: tuple (df_seq_repr, X_repr) with only the representative proteins.

    • both: tuple (df_seq, X_repr) with the annotated df_seq and representative-only X.

Returns:

  • df_seq (pd.DataFrame) – df_seq extended by cluster (cluster label), is_representative (1 for the representative protein of each cluster, else 0), and dist_to_rep (distance to the representative under metric; 0 for representatives). Returned when return_data='annotated'.

  • df_seq_repr, X_repr (pd.DataFrame and array-like) – The representative proteins only (return_data='filtered'), or the annotated df_seq together with the representative-only feature matrix (return_data='both').

Notes

  • Representatives are the cluster medoids, so is_representative sums to the number of clusters.

  • Under metric='correlation' the distance to the representative is 1 - Pearson; it is undefined (NaN) for a protein whose feature vector has zero variance.

See also

Examples

The AAclust().select_proteins() method reduces redundancy among proteins based on a numerical per-protein feature matrix X (e.g. a CPP feature matrix, pooled embeddings, or structural/DSSP-derived features). It clusters the proteins and keeps one representative (medoid) per cluster, reporting the result on df_seq — analogous to sequence-based redundancy reduction, but for numerical data.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

# Load a sequence dataset and build one pre-pooled feature vector per protein
# (here: amino acid composition; in practice a CPP matrix or pooled embedding)
df_seq = aa.load_dataset(name="DOM_GSEC", n=25)
aa_letters = "ACDEFGHIKLMNPQRSTVWY"
X = np.array([[seq.count(a) / len(seq) for a in aa_letters]
              for seq in df_seq["sequence"]])
print("df_seq:", df_seq.shape, " X (n_proteins, n_features):", X.shape)
aa.display_df(df_seq, n_rows=10, show_shape=True)
df_seq: (50, 8)  X (n_proteins, n_features): (50, 20)
DataFrame shape: (50, 8)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP
5 Q8IUW5 MAPRALPGSAVLAAA...EVPATPVKRERSGTE 0 59 81 NDTGNGHPEY IAYALVPVFFIMGLFGVLICHLL KKKGYRCTTE
6 P01135 MVPSAGQLALFALGI...LLKGRTACCHSETVV 0 99 121 AVVAASQKKQ AITALVVVSIVALAVLIITCVLI HCCQVRKHCE
7 O43914 MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK 0 42 64 DCSCSTVSPG VLAGIVMGDLVLTVLIALAVYFL GRLVPRGRGA
8 P05556 MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK 0 729 751 ENPECPTGPD IIPIVAGVVAGIVLIGLALLLIW KLLMIIHDRR
9 P16234 MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL 0 527 549 VAPTLRSELT VAAAVLVLLVIVIISLIVLVVIW KQKPRYEIRW
10 P50895 MEPPDAPAQARGAPR...SGGARGGSGGFGDEC 0 549 571 TVSPQTSQAG VAVMAVAVSVGLLLLVVAVFYCV RRKGGPCCRQ

By default (return_data='annotated'), select_proteins returns df_seq with three added columns: cluster (cluster label), is_representative (1 for the representative protein of each cluster), and dist_to_rep (distance to that representative; 0 for representatives):

aac = aa.AAclust()
df_annot = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10)
aa.display_df(df_annot, n_rows=10, show_shape=True)
DataFrame shape: (50, 11)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c cluster is_representative dist_to_rep
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 8 1 0.000000
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 9 1 0.000000
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 6 0 0.051741
4 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP 4 0 0.102295
5 Q8IUW5 MAPRALPGSAVLAAA...EVPATPVKRERSGTE 0 59 81 NDTGNGHPEY IAYALVPVFFIMGLFGVLICHLL KKKGYRCTTE 2 0 0.082909
6 P01135 MVPSAGQLALFALGI...LLKGRTACCHSETVV 0 99 121 AVVAASQKKQ AITALVVVSIVALAVLIITCVLI HCCQVRKHCE 0 1 0.000000
7 O43914 MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK 0 42 64 DCSCSTVSPG VLAGIVMGDLVLTVLIALAVYFL GRLVPRGRGA 3 1 0.000000
8 P05556 MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK 0 729 751 ENPECPTGPD IIPIVAGVVAGIVLIGLALLLIW KLLMIIHDRR 4 0 0.062169
9 P16234 MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL 0 527 549 VAPTLRSELT VAAAVLVLLVIVIISLIVLVVIW KQKPRYEIRW 1 0 0.050491
10 P50895 MEPPDAPAQARGAPR...SGGARGGSGGFGDEC 0 549 571 TVSPQTSQAG VAVMAVAVSVGLLLLVVAVFYCV RRKGGPCCRQ 6 0 0.070971

With return_data='filtered' only the representative proteins are returned, together with their feature rows — a redundancy-reduced set ready for downstream analysis:

df_repr, X_repr = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10,
                                      return_data="filtered")
print("representatives:", df_repr.shape, " X_repr:", X_repr.shape)
aa.display_df(df_repr, n_rows=10, show_shape=True)
representatives: (10, 11)  X_repr: (10, 20)
DataFrame shape: (10, 11)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c cluster is_representative dist_to_rep
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 8 1 0.000000
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 9 1 0.000000
3 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP 7 1 0.000000
4 P01135 MVPSAGQLALFALGI...LLKGRTACCHSETVV 0 99 121 AVVAASQKKQ AITALVVVSIVALAVLIITCVLI HCCQVRKHCE 4 1 0.000000
5 P01732 MALPVTALLLPLALL...VVKSGDKPSLSARYV 0 184 206 TRGLDFACDI YIWAPLAGTCGVLLLSLVITLYC NHRNRRRVCK 1 1 0.000000
6 Q8VHS2 MKLKRTAYLLFLYLS...EMWIRMPPPALERLI 0 1346 1368 ADDRLLGIFT AVGSGTLALFFILLLAGVASLIA SNKRATQGTY 2 1 0.000000
7 P09603 MTAPGAAGRCPPTTW...GSPLTQDDRQVELPV 1 496 518 EGSFSPQLQE SVFHLLVPSVILVLLAVGGLLFY RWRRRSHQEP 6 1 0.000000
8 Q9H4D0 MLPGRLCWVPLLLAL...ARQAQLEWDDSTLPY 1 831 853 SSIQHSSVVP SIATVVIIISVCMLVFVVAMGVY RVRIAHQHFI 0 1 0.000000
9 D3ZZK3 MAGIFYFILFSFLFG...MRTQMQQMHGRMVPV 1 548 570 RIIGDGANST VLLVSVSGSVVLVVILIAAFVIS RRRSKYSQAK 3 1 0.000000
10 P27930 MLRLYVLVMGVSAFT...TVLWPHHQDFQSYPK 1 347 369 LRTTVKEASS TFSWGIVLAPLSLAFLVLGGIWM HRRCKHRTGK 5 1 0.000000

return_data='both' returns the fully annotated df_seq and the representative-only feature matrix — keeping the cluster map alongside the reduced set:

df_full, X_repr = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10,
                                      return_data="both")
print("annotated df_seq:", df_full.shape, " X_repr:", X_repr.shape)
annotated df_seq: (50, 11)  X_repr: (10, 20)

If n_clusters is omitted, AAclust optimizes the number of clusters automatically. The min_th (correlation threshold) and on_center parameters tune that optimization, exactly as in AAclust().fit():

df_auto = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=None,
                              min_th=0.5, on_center=True)
print("optimized number of representatives:",
      int(df_auto["is_representative"].sum()))
optimized number of representatives: 1

The metric parameter sets the distance measure used to pick representatives and to compute dist_to_rep:

for metric in ["euclidean", "correlation", "manhattan", "cosine"]:
    df_m = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10, metric=metric)
    mean_dist = df_m["dist_to_rep"].mean()
    print(f"{metric:>12}: mean dist_to_rep = {mean_dist:.3f}")
  euclidean: mean dist_to_rep = 0.050
correlation: mean dist_to_rep = 0.140
  manhattan: mean dist_to_rep = 0.187
     cosine: mean dist_to_rep = 0.028