AAclust.select_proteins

AAclust.select_proteins(df_seq, X, n_clusters=None, on_center=True, min_th=0.3, metric='euclidean', return_data='annotated')[source]

Select a redundancy-reduced set of proteins from a per-protein feature matrix.

Clusters proteins by their numerical representation in X (e.g. a CPP feature matrix, pooled per-protein embeddings, or structural/DSSP-derived features) and selects one representative protein (the medoid) per cluster [Breimann24a]. X must be pre-pooled to one row per protein; pooling per-residue inputs to a per-protein vector is left to the caller. The result is reported on df_seq in the same style as sequence-based redundancy reduction: a cluster label, an is_representative flag, and the dist_to_rep distance of each protein to its representative.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_proteins, n_columns)) – DataFrame containing an entry column with unique protein identifiers, row-aligned to X (row i of df_seq describes protein i in X).
X (array-like, shape (n_proteins, n_features)) – Per-protein feature matrix. Rows correspond to proteins and columns to features.
n_clusters (int, optional) – Pre-defined number of clusters (selected proteins). If provided, k is not optimized. Must be 0 < n_clusters < n_proteins.
min_th (float, default=0.3) – Pearson correlation threshold for clustering optimization (between 0 and 1).
on_center (bool, default=True) – If True, min_th is applied to the cluster center. Otherwise, to all cluster members.
metric ({'correlation', 'euclidean', 'manhattan', 'cosine'}, default='euclidean') –
Similarity measure used for obtaining medoids and computing dist_to_rep:
- correlation: Pearson correlation (maximum)
- euclidean: Euclidean distance (minimum)
- manhattan: Manhattan distance (minimum)
- cosine: Cosine distance (minimum)
return_data ({'annotated', 'filtered', 'both'}, default='annotated') –
Controls the returned data:
- annotated: df_seq with three added columns (one row per protein).
- filtered: tuple (df_seq_repr, X_repr) with only the representative proteins.
- both: tuple (df_seq, X_repr) with the annotated df_seq and representative-only X.

Returns:

df_seq (pd.DataFrame) – df_seq extended by cluster (cluster label), is_representative (1 for the representative protein of each cluster, else 0), and dist_to_rep (distance to the representative under metric; 0 for representatives). Returned when return_data='annotated'.
df_seq_repr, X_repr (pd.DataFrame and array-like) – The representative proteins only (return_data='filtered'), or the annotated df_seq together with the representative-only feature matrix (return_data='both').

Notes

Representatives are the cluster medoids, so is_representative sums to the number of clusters.
Under metric='correlation' the distance to the representative is 1 - Pearson; it is undefined (NaN) for a protein whose feature vector has zero variance.

See also

AAclust.fit(): The underlying clustering used for protein selection.
AAclust.comp_medoids(): The medoid computation defining the representatives.

Examples

The AAclust().select_proteins() method reduces redundancy among proteins based on a numerical per-protein feature matrix X (e.g. a CPP feature matrix, pooled embeddings, or structural/DSSP-derived features). It clusters the proteins and keeps one representative (medoid) per cluster, reporting the result on df_seq — analogous to sequence-based redundancy reduction, but for numerical data.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

# Load a sequence dataset and build one pre-pooled feature vector per protein
# (here: amino acid composition; in practice a CPP matrix or pooled embedding)
df_seq = aa.load_dataset(name="DOM_GSEC", n=25)
aa_letters = "ACDEFGHIKLMNPQRSTVWY"
X = np.array([[seq.count(a) / len(seq) for a in aa_letters]
              for seq in df_seq["sequence"]])
print("df_seq:", df_seq.shape, " X (n_proteins, n_features):", X.shape)
aa.display_df(df_seq, n_rows=10, show_shape=True)

df_seq: (50, 8)  X (n_proteins, n_features): (50, 20)
DataFrame shape: (50, 8)

	entry	sequence	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P53801	MAPGVARGPTPYWRL...GLFKEENPYARFENN	97	119	RWGVCWVNFE	ALIITMSVVGGTLLLGIAICCCC	CCRRKRSRKP
5	Q8IUW5	MAPRALPGSAVLAAA...EVPATPVKRERSGTE	59	81	NDTGNGHPEY	IAYALVPVFFIMGLFGVLICHLL	KKKGYRCTTE
6	P01135	MVPSAGQLALFALGI...LLKGRTACCHSETVV	99	121	AVVAASQKKQ	AITALVVVSIVALAVLIITCVLI	HCCQVRKHCE
7	O43914	MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK	42	64	DCSCSTVSPG	VLAGIVMGDLVLTVLIALAVYFL	GRLVPRGRGA
8	P05556	MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK	729	751	ENPECPTGPD	IIPIVAGVVAGIVLIGLALLLIW	KLLMIIHDRR
9	P16234	MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL	527	549	VAPTLRSELT	VAAAVLVLLVIVIISLIVLVVIW	KQKPRYEIRW
10	P50895	MEPPDAPAQARGAPR...SGGARGGSGGFGDEC	549	571	TVSPQTSQAG	VAVMAVAVSVGLLLLVVAVFYCV	RRKGGPCCRQ

By default (return_data='annotated'), select_proteins returns df_seq with three added columns: cluster (cluster label), is_representative (1 for the representative protein of each cluster), and dist_to_rep (distance to that representative; 0 for representatives):

aac = aa.AAclust()
df_annot = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10)
aa.display_df(df_annot, n_rows=10, show_shape=True)

DataFrame shape: (50, 11)

	entry	sequence	tmd_start	tmd_stop	jmd_n	tmd	jmd_c	cluster	is_representative	dist_to_rep
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS	8	1	0.000000
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR	9	1	0.000000
3	Q969W9	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI	6	0	0.051741
4	P53801	MAPGVARGPTPYWRL...GLFKEENPYARFENN	97	119	RWGVCWVNFE	ALIITMSVVGGTLLLGIAICCCC	CCRRKRSRKP	4	0	0.102295
5	Q8IUW5	MAPRALPGSAVLAAA...EVPATPVKRERSGTE	59	81	NDTGNGHPEY	IAYALVPVFFIMGLFGVLICHLL	KKKGYRCTTE	2	0	0.082909
6	P01135	MVPSAGQLALFALGI...LLKGRTACCHSETVV	99	121	AVVAASQKKQ	AITALVVVSIVALAVLIITCVLI	HCCQVRKHCE	0	1	0.000000
7	O43914	MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK	42	64	DCSCSTVSPG	VLAGIVMGDLVLTVLIALAVYFL	GRLVPRGRGA	3	1	0.000000
8	P05556	MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK	729	751	ENPECPTGPD	IIPIVAGVVAGIVLIGLALLLIW	KLLMIIHDRR	4	0	0.062169
9	P16234	MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL	527	549	VAPTLRSELT	VAAAVLVLLVIVIISLIVLVVIW	KQKPRYEIRW	1	0	0.050491
10	P50895	MEPPDAPAQARGAPR...SGGARGGSGGFGDEC	549	571	TVSPQTSQAG	VAVMAVAVSVGLLLLVVAVFYCV	RRKGGPCCRQ	6	0	0.070971

With return_data='filtered' only the representative proteins are returned, together with their feature rows — a redundancy-reduced set ready for downstream analysis:

df_repr, X_repr = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10,
                                      return_data="filtered")
print("representatives:", df_repr.shape, " X_repr:", X_repr.shape)
aa.display_df(df_repr, n_rows=10, show_shape=True)

representatives: (10, 11)  X_repr: (10, 20)
DataFrame shape: (10, 11)

	entry	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c	cluster	is_representative
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS	8	1
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	0	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR	9	1
3	P53801	MAPGVARGPTPYWRL...GLFKEENPYARFENN	0	97	119	RWGVCWVNFE	ALIITMSVVGGTLLLGIAICCCC	CCRRKRSRKP	7	1
4	P01135	MVPSAGQLALFALGI...LLKGRTACCHSETVV	0	99	121	AVVAASQKKQ	AITALVVVSIVALAVLIITCVLI	HCCQVRKHCE	4	1
5	P01732	MALPVTALLLPLALL...VVKSGDKPSLSARYV	0	184	206	TRGLDFACDI	YIWAPLAGTCGVLLLSLVITLYC	NHRNRRRVCK	1	1
6	Q8VHS2	MKLKRTAYLLFLYLS...EMWIRMPPPALERLI	0	1346	1368	ADDRLLGIFT	AVGSGTLALFFILLLAGVASLIA	SNKRATQGTY	2	1
7	P09603	MTAPGAAGRCPPTTW...GSPLTQDDRQVELPV	1	496	518	EGSFSPQLQE	SVFHLLVPSVILVLLAVGGLLFY	RWRRRSHQEP	6	1
8	Q9H4D0	MLPGRLCWVPLLLAL...ARQAQLEWDDSTLPY	1	831	853	SSIQHSSVVP	SIATVVIIISVCMLVFVVAMGVY	RVRIAHQHFI	0	1
9	D3ZZK3	MAGIFYFILFSFLFG...MRTQMQQMHGRMVPV	1	548	570	RIIGDGANST	VLLVSVSGSVVLVVILIAAFVIS	RRRSKYSQAK	3	1
10	P27930	MLRLYVLVMGVSAFT...TVLWPHHQDFQSYPK	1	347	369	LRTTVKEASS	TFSWGIVLAPLSLAFLVLGGIWM	HRRCKHRTGK	5	1

return_data='both' returns the fully annotated df_seq and the representative-only feature matrix — keeping the cluster map alongside the reduced set:

df_full, X_repr = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10,
                                      return_data="both")
print("annotated df_seq:", df_full.shape, " X_repr:", X_repr.shape)

annotated df_seq: (50, 11)  X_repr: (10, 20)

If n_clusters is omitted, AAclust optimizes the number of clusters automatically. The min_th (correlation threshold) and on_center parameters tune that optimization, exactly as in AAclust().fit():

df_auto = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=None,
                              min_th=0.5, on_center=True)
print("optimized number of representatives:",
      int(df_auto["is_representative"].sum()))

optimized number of representatives: 1

The metric parameter sets the distance measure used to pick representatives and to compute dist_to_rep:

for metric in ["euclidean", "correlation", "manhattan", "cosine"]:
    df_m = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10, metric=metric)
    mean_dist = df_m["dist_to_rep"].mean()
    print(f"{metric:>12}: mean dist_to_rep = {mean_dist:.3f}")

  euclidean: mean dist_to_rep = 0.050
correlation: mean dist_to_rep = 0.140
  manhattan: mean dist_to_rep = 0.187
     cosine: mean dist_to_rep = 0.028