AAclust.select_proteins
- AAclust.select_proteins(df_seq, X, n_clusters=None, on_center=True, min_th=0.3, metric='euclidean', return_data='annotated')[source]
Select a redundancy-reduced set of proteins from a per-protein feature matrix.
Clusters proteins by their numerical representation in
X(e.g. a CPP feature matrix, pooled per-protein embeddings, or structural/DSSP-derived features) and selects one representative protein (the medoid) per cluster [Breimann24a].Xmust be pre-pooled to one row per protein; pooling per-residue inputs to a per-protein vector is left to the caller. The result is reported ondf_seqin the same style as sequence-based redundancy reduction: aclusterlabel, anis_representativeflag, and thedist_to_repdistance of each protein to its representative.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_proteins, n_columns)) – DataFrame containing an
entrycolumn with unique protein identifiers, row-aligned toX(row i ofdf_seqdescribes protein i inX).X (array-like, shape (n_proteins, n_features)) – Per-protein feature matrix. Rows correspond to proteins and columns to features.
n_clusters (int, optional) – Pre-defined number of clusters (selected proteins). If provided, k is not optimized. Must be 0 < n_clusters < n_proteins.
min_th (float, default=0.3) – Pearson correlation threshold for clustering optimization (between 0 and 1).
on_center (bool, default=True) – If
True,min_this applied to the cluster center. Otherwise, to all cluster members.metric ({'correlation', 'euclidean', 'manhattan', 'cosine'}, default='euclidean') –
Similarity measure used for obtaining medoids and computing
dist_to_rep:correlation: Pearson correlation (maximum)euclidean: Euclidean distance (minimum)manhattan: Manhattan distance (minimum)cosine: Cosine distance (minimum)
return_data ({'annotated', 'filtered', 'both'}, default='annotated') –
Controls the returned data:
annotated:df_seqwith three added columns (one row per protein).filtered: tuple(df_seq_repr, X_repr)with only the representative proteins.both: tuple(df_seq, X_repr)with the annotateddf_seqand representative-onlyX.
- Returns:
df_seq (pd.DataFrame) –
df_seqextended bycluster(cluster label),is_representative(1 for the representative protein of each cluster, else 0), anddist_to_rep(distance to the representative undermetric; 0 for representatives). Returned whenreturn_data='annotated'.df_seq_repr, X_repr (pd.DataFrame and array-like) – The representative proteins only (
return_data='filtered'), or the annotateddf_seqtogether with the representative-only feature matrix (return_data='both').
Notes
Representatives are the cluster medoids, so
is_representativesums to the number of clusters.Under
metric='correlation'the distance to the representative is1 - Pearson; it is undefined (NaN) for a protein whose feature vector has zero variance.
See also
AAclust.fit(): The underlying clustering used for protein selection.AAclust.comp_medoids(): The medoid computation defining the representatives.
Examples
The
AAclust().select_proteins()method reduces redundancy among proteins based on a numerical per-protein feature matrixX(e.g. a CPP feature matrix, pooled embeddings, or structural/DSSP-derived features). It clusters the proteins and keeps one representative (medoid) per cluster, reporting the result ondf_seq— analogous to sequence-based redundancy reduction, but for numerical data.import numpy as np import aaanalysis as aa aa.options["verbose"] = False # Load a sequence dataset and build one pre-pooled feature vector per protein # (here: amino acid composition; in practice a CPP matrix or pooled embedding) df_seq = aa.load_dataset(name="DOM_GSEC", n=25) aa_letters = "ACDEFGHIKLMNPQRSTVWY" X = np.array([[seq.count(a) / len(seq) for a in aa_letters] for seq in df_seq["sequence"]]) print("df_seq:", df_seq.shape, " X (n_proteins, n_features):", X.shape) aa.display_df(df_seq, n_rows=10, show_shape=True)
df_seq: (50, 8) X (n_proteins, n_features): (50, 20) DataFrame shape: (50, 8)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 4 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP 5 Q8IUW5 MAPRALPGSAVLAAA...EVPATPVKRERSGTE 0 59 81 NDTGNGHPEY IAYALVPVFFIMGLFGVLICHLL KKKGYRCTTE 6 P01135 MVPSAGQLALFALGI...LLKGRTACCHSETVV 0 99 121 AVVAASQKKQ AITALVVVSIVALAVLIITCVLI HCCQVRKHCE 7 O43914 MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK 0 42 64 DCSCSTVSPG VLAGIVMGDLVLTVLIALAVYFL GRLVPRGRGA 8 P05556 MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK 0 729 751 ENPECPTGPD IIPIVAGVVAGIVLIGLALLLIW KLLMIIHDRR 9 P16234 MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL 0 527 549 VAPTLRSELT VAAAVLVLLVIVIISLIVLVVIW KQKPRYEIRW 10 P50895 MEPPDAPAQARGAPR...SGGARGGSGGFGDEC 0 549 571 TVSPQTSQAG VAVMAVAVSVGLLLLVVAVFYCV RRKGGPCCRQ By default (
return_data='annotated'),select_proteinsreturnsdf_seqwith three added columns:cluster(cluster label),is_representative(1 for the representative protein of each cluster), anddist_to_rep(distance to that representative; 0 for representatives):aac = aa.AAclust() df_annot = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10) aa.display_df(df_annot, n_rows=10, show_shape=True)
DataFrame shape: (50, 11)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c cluster is_representative dist_to_rep 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 8 1 0.000000 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 9 1 0.000000 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 6 0 0.051741 4 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP 4 0 0.102295 5 Q8IUW5 MAPRALPGSAVLAAA...EVPATPVKRERSGTE 0 59 81 NDTGNGHPEY IAYALVPVFFIMGLFGVLICHLL KKKGYRCTTE 2 0 0.082909 6 P01135 MVPSAGQLALFALGI...LLKGRTACCHSETVV 0 99 121 AVVAASQKKQ AITALVVVSIVALAVLIITCVLI HCCQVRKHCE 0 1 0.000000 7 O43914 MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK 0 42 64 DCSCSTVSPG VLAGIVMGDLVLTVLIALAVYFL GRLVPRGRGA 3 1 0.000000 8 P05556 MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK 0 729 751 ENPECPTGPD IIPIVAGVVAGIVLIGLALLLIW KLLMIIHDRR 4 0 0.062169 9 P16234 MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL 0 527 549 VAPTLRSELT VAAAVLVLLVIVIISLIVLVVIW KQKPRYEIRW 1 0 0.050491 10 P50895 MEPPDAPAQARGAPR...SGGARGGSGGFGDEC 0 549 571 TVSPQTSQAG VAVMAVAVSVGLLLLVVAVFYCV RRKGGPCCRQ 6 0 0.070971 With
return_data='filtered'only the representative proteins are returned, together with their feature rows — a redundancy-reduced set ready for downstream analysis:df_repr, X_repr = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10, return_data="filtered") print("representatives:", df_repr.shape, " X_repr:", X_repr.shape) aa.display_df(df_repr, n_rows=10, show_shape=True)
representatives: (10, 11) X_repr: (10, 20) DataFrame shape: (10, 11)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c cluster is_representative dist_to_rep 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 8 1 0.000000 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 9 1 0.000000 3 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP 7 1 0.000000 4 P01135 MVPSAGQLALFALGI...LLKGRTACCHSETVV 0 99 121 AVVAASQKKQ AITALVVVSIVALAVLIITCVLI HCCQVRKHCE 4 1 0.000000 5 P01732 MALPVTALLLPLALL...VVKSGDKPSLSARYV 0 184 206 TRGLDFACDI YIWAPLAGTCGVLLLSLVITLYC NHRNRRRVCK 1 1 0.000000 6 Q8VHS2 MKLKRTAYLLFLYLS...EMWIRMPPPALERLI 0 1346 1368 ADDRLLGIFT AVGSGTLALFFILLLAGVASLIA SNKRATQGTY 2 1 0.000000 7 P09603 MTAPGAAGRCPPTTW...GSPLTQDDRQVELPV 1 496 518 EGSFSPQLQE SVFHLLVPSVILVLLAVGGLLFY RWRRRSHQEP 6 1 0.000000 8 Q9H4D0 MLPGRLCWVPLLLAL...ARQAQLEWDDSTLPY 1 831 853 SSIQHSSVVP SIATVVIIISVCMLVFVVAMGVY RVRIAHQHFI 0 1 0.000000 9 D3ZZK3 MAGIFYFILFSFLFG...MRTQMQQMHGRMVPV 1 548 570 RIIGDGANST VLLVSVSGSVVLVVILIAAFVIS RRRSKYSQAK 3 1 0.000000 10 P27930 MLRLYVLVMGVSAFT...TVLWPHHQDFQSYPK 1 347 369 LRTTVKEASS TFSWGIVLAPLSLAFLVLGGIWM HRRCKHRTGK 5 1 0.000000 return_data='both'returns the fully annotateddf_seqand the representative-only feature matrix — keeping the cluster map alongside the reduced set:df_full, X_repr = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10, return_data="both") print("annotated df_seq:", df_full.shape, " X_repr:", X_repr.shape)
annotated df_seq: (50, 11) X_repr: (10, 20)
If
n_clustersis omitted, AAclust optimizes the number of clusters automatically. Themin_th(correlation threshold) andon_centerparameters tune that optimization, exactly as inAAclust().fit():df_auto = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=None, min_th=0.5, on_center=True) print("optimized number of representatives:", int(df_auto["is_representative"].sum()))
optimized number of representatives: 1
The
metricparameter sets the distance measure used to pick representatives and to computedist_to_rep:for metric in ["euclidean", "correlation", "manhattan", "cosine"]: df_m = aac.select_proteins(df_seq=df_seq, X=X, n_clusters=10, metric=metric) mean_dist = df_m["dist_to_rep"].mean() print(f"{metric:>12}: mean dist_to_rep = {mean_dist:.3f}")
euclidean: mean dist_to_rep = 0.050 correlation: mean dist_to_rep = 0.140 manhattan: mean dist_to_rep = 0.187 cosine: mean dist_to_rep = 0.028