CPP across data representations: sequence, structure, embeddings, annotation
A CPP feature is part × split × value: where in the sequence
(tmd, jmd_n_tmd_n, tmd_c_jmd_c), how the positions are
picked (Segment / Pattern / PeriodicPattern), and what
value is summarized there. The value axis has two interchangeable
sources that share one grammar and one df_feat schema:
a physicochemical amino acid scale (the sequence arm,
CPP.run), andany per-residue numerical tensor in
[0, 1](the numerical arm,CPP.run_num).
So learned, structural, and annotation-based representations all ride
the exact same interpretable CPP machinery. Below we feed four
representations of the same proteins into CPP and chart each one’s
CPPPlot.feature_map — the same map layout, but a different kind of
feature each time:
sequence — amino acid scales,
structure — AlphaFold/DSSP channels,
embeddings — ESM-2 protein-language-model channels,
annotation — UniProt residue annotations.
(The structure / embedding / annotation arms need the
aaanalysis[pro] / aaanalysis[embed] extras.)
import tempfile
import aaanalysis as aa
import matplotlib.pyplot as plt
aa.options["verbose"] = False
aa.plot_settings()
# Same small protein set for every representation. n_jobs=1 keeps CPP
# deterministic and avoids the macOS process-spawn footgun.
df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
labels = list(df_seq["label"])
def add_importance(df_feat):
# Model-free importance for the map: each feature's discriminative power
# (abs_auc), normalized to %. Swap in TreeModel.add_feat_importance for a
# model-based ranking once you have a per-protein feature matrix.
df_feat = df_feat.copy()
df_feat["feat_importance"] = 100 * df_feat["abs_auc"] / df_feat["abs_auc"].sum()
return df_feat
Arm 1 — sequence: physicochemical scales (CPP.run)
The standard CPP: split the sequence parts and summarize amino acid scale values. The feature categories are the AAontology subcategories.
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_feat_seq = aa.CPP(df_scales=aa.load_scales(), df_parts=df_parts).run(
labels=labels, n_filter=50, n_jobs=1)
df_feat_seq = add_importance(df_feat_seq)
aa.display_df(df_feat_seq, n_rows=10, show_shape=True)
[94mCPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.[0m DataFrame shape: (50, 14)
| feature | category | subcategory | scale_name | scale_description | abs_auc | abs_mean_dif | mean_dif | std_test | std_ref | p_val_mann_whitney | p_val_fdr_bh | positions | feat_importance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | JMD_N_TMD_N-Seg...,10)-ZIMJ680101 | Polarity | Hydrophobicity | Hydrophobicity | Hydrophobicity ...n et al., 1968) | 0.500000 | 0.361000 | 0.361000 | 0.156000 | 0.150000 | 0.000157 | 1.000000 | 1,2 | 2.273244 |
| 2 | JMD_N_TMD_N-Pat...,12)-PALJ810110 | Conformation | β-sheet | β-sheet | Normalized freq...u et al., 1981) | 0.470000 | 0.233000 | -0.233000 | 0.092000 | 0.095000 | 0.000381 | 1.000000 | 9,13,16,19 | 2.136849 |
| 3 | TMD-Pattern(N,1...,11)-PALJ810110 | Conformation | β-sheet | β-sheet | Normalized freq...u et al., 1981) | 0.470000 | 0.233000 | -0.233000 | 0.092000 | 0.095000 | 0.000381 | 1.000000 | 11,15,18,21 | 2.136849 |
| 4 | TMD_C_JMD_C-Pat...,12)-TANS770105 | Conformation | β-turn (C-term) | β-turn (3rd residue) | Normalized freq...Scheraga, 1977) | 0.470000 | 0.230000 | -0.230000 | 0.061000 | 0.111000 | 0.000381 | 1.000000 | 24,28,32 | 2.136849 |
| 5 | TMD_C_JMD_C-Pat...,15)-AURR980102 | Conformation | Linker (6-14 AA) | α-helix (N-terminal, outside) | Normalized posi...ora-Rose, 1998) | 0.465000 | 0.189000 | 0.189000 | 0.054000 | 0.099000 | 0.000440 | 1.000000 | 26,29,33 | 2.114117 |
| 6 | TMD_C_JMD_C-Pat...,14)-AURR980102 | Conformation | Linker (6-14 AA) | α-helix (N-terminal, outside) | Normalized posi...ora-Rose, 1998) | 0.465000 | 0.189000 | 0.189000 | 0.054000 | 0.099000 | 0.000440 | 1.000000 | 27,30,34 | 2.114117 |
| 7 | TMD-Pattern(C,4,8)-CHOP780201 | Conformation | α-helix | α-helix | Normalized freq...-Fasman, 1978b) | 0.460000 | 0.279000 | 0.279000 | 0.095000 | 0.176000 | 0.000507 | 0.816292 | 23,27 | 2.091384 |
| 8 | TMD_C_JMD_C-Pat...4,8)-CHOP780201 | Conformation | α-helix | α-helix | Normalized freq...-Fasman, 1978b) | 0.460000 | 0.279000 | 0.279000 | 0.095000 | 0.176000 | 0.000507 | 1.000000 | 24,28 | 2.091384 |
| 9 | TMD-Pattern(C,4,8)-RACS820114 | Shape | Graph (2. eigenvalue) | Side chain angle (Theta) | Value of theta(...Scheraga, 1982) | 0.460000 | 0.201000 | -0.201000 | 0.044000 | 0.133000 | 0.000507 | 1.000000 | 23,27 | 2.091384 |
| 10 | TMD_C_JMD_C-Pat...4,8)-RACS820114 | Shape | Graph (2. eigenvalue) | Side chain angle (Theta) | Value of theta(...Scheraga, 1982) | 0.460000 | 0.201000 | -0.201000 | 0.044000 | 0.133000 | 0.000507 | 0.979550 | 24,28 | 2.091384 |
aa.CPPPlot(jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_seq)
plt.tight_layout()
plt.show()
Arm 2 — structure: AlphaFold / DSSP (CPP.run_num)
Fetch AlphaFold structures, derive per-residue DSSP channels
(secondary structure, relative accessible surface area, backbone angles)
with StructurePreprocessor.encode_dssp, and run the identical
CPP algorithm via CPP.run_num. The features are now structural
channels. (encode_dssp drops any entry without a structure; we run
CPP on the structured subset.)
STRUCT_FEATURES = ["ss3", "rasa", "phi_psi_sincos"]
stp = aa.StructurePreprocessor()
pdb_folder = tempfile.mkdtemp()
stp.fetch_alphafold(df_seq=df_seq, out_folder=pdb_folder, on_failure="nan")
dict_str = stp.encode_dssp(df_seq=df_seq, pdb_folder=pdb_folder, features=STRUCT_FEATURES, on_failure="drop")
df_seq_str = df_seq[df_seq["entry"].isin(dict_str)].reset_index(drop=True)
df_scales_str = stp.build_scales(df_seq=df_seq_str, dict_num=dict_str, features=STRUCT_FEATURES)
df_cat_str = stp.build_cat(features=STRUCT_FEATURES)
df_parts_str, dict_num_parts_str = aa.NumericalFeature.get_parts(df_seq=df_seq_str, dict_num=dict_str)
df_feat_str = aa.CPP(df_parts=df_parts_str, df_scales=df_scales_str, df_cat=df_cat_str).run_num(
dict_num_parts=dict_num_parts_str, labels=list(df_seq_str["label"]), n_filter=50, n_jobs=1)
df_feat_str = add_importance(df_feat_str)
aa.display_df(df_feat_str, n_rows=10, show_shape=True)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file
warnings.warn(err)
/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_36514/3891309445.py:8: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq + dict_num). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
df_scales_str = stp.build_scales(df_seq=df_seq_str, dict_num=dict_str, features=STRUCT_FEATURES)
DataFrame shape: (27, 14)
/Users/stephanbreimann/Programming/1Packages/aaanalysis-cpp-representations/aaanalysis/feature_engineering/_backend/cpp_run.py:112: RuntimeWarning: 'n_filter' (50) should be <= the number of features the filter could deliver (27); returning fewer features than requested. Inspect df_feat.attrs['last_filter_stats'] and consider a larger 'df_scales' / 'n_jmd' / less strict 'max_overlap'/'max_cor'.
warnings.warn(
| feature | category | subcategory | scale_name | scale_description | abs_auc | abs_mean_dif | mean_dif | std_test | std_ref | p_val_mann_whitney | p_val_fdr_bh | positions | feat_importance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | JMD_N_TMD_N-Seg...nt(1,2)-psi_sin | Structure | Backbone dihedral (sin/cos) | psi_sin | Structure/Backb...edral (sin/cos) | 0.420000 | 0.183000 | 0.183000 | 0.185000 | 0.192000 | 0.001499 | 0.593554 | 1,2,3,4,5,6,7,8,9,10 | 6.441718 |
| 2 | JMD_N_TMD_N-Pat...ern(N,2,5)-rasa | Structure | Relative ASA (Tien) | rasa | Structure/Relative ASA (Tien) | 0.350000 | 0.238000 | 0.238000 | 0.164000 | 0.159000 | 0.008151 | 0.084942 | 2,5 | 5.368098 |
| 3 | JMD_N_TMD_N-Segment(1,5)-rasa | Structure | Relative ASA (Tien) | rasa | Structure/Relative ASA (Tien) | 0.340000 | 0.162000 | 0.162000 | 0.193000 | 0.132000 | 0.010165 | 0.087509 | 1,2,3,4 | 5.214724 |
| 4 | JMD_N_TMD_N-Per...+3/3,1)-ss_coil | Structure | Secondary structure (3-state) | ss_coil | Structure/Secon...cture (3-state) | 0.335000 | 0.126000 | -0.126000 | 0.038000 | 0.102000 | 0.011330 | 0.083084 | 2,5,8,11,14,17,20 | 5.138037 |
| 5 | JMD_N_TMD_N-Per...+3/3,1)-ss_coil | Structure | Secondary structure (3-state) | ss_coil | Structure/Secon...cture (3-state) | 0.335000 | 0.126000 | -0.126000 | 0.038000 | 0.102000 | 0.011330 | 0.081574 | 1,4,7,10,13,16,19 | 5.138037 |
| 6 | JMD_N_TMD_N-Pat...5,8,11)-psi_cos | Structure | Backbone dihedral (sin/cos) | psi_cos | Structure/Backb...edral (sin/cos) | 0.330000 | 0.211000 | -0.211000 | 0.193000 | 0.145000 | 0.012611 | 0.076831 | 1,5,8,11 | 5.061350 |
| 7 | JMD_N_TMD_N-Per...+4/3,4)-psi_cos | Structure | Backbone dihedral (sin/cos) | psi_cos | Structure/Backb...edral (sin/cos) | 0.330000 | 0.120000 | -0.120000 | 0.107000 | 0.098000 | 0.012611 | 0.079270 | 3,6,10,13,17 | 5.061350 |
| 8 | JMD_N_TMD_N-Pat...(N,2,6,10)-rasa | Structure | Relative ASA (Tien) | rasa | Structure/Relative ASA (Tien) | 0.320000 | 0.098000 | 0.098000 | 0.171000 | 0.105000 | 0.015564 | 0.071669 | 2,6,10 | 4.907975 |
| 9 | JMD_N_TMD_N-Pat...,1,4,7,10)-rasa | Structure | Relative ASA (Tien) | rasa | Structure/Relative ASA (Tien) | 0.300000 | 0.100000 | 0.100000 | 0.129000 | 0.104000 | 0.023342 | 0.088033 | 1,4,7,10 | 4.601227 |
| 10 | JMD_N_TMD_N-Seg...t(6,15)-phi_sin | Structure | Backbone dihedral (sin/cos) | phi_sin | Structure/Backb...edral (sin/cos) | 0.280000 | 0.091000 | 0.091000 | 0.194000 | 0.027000 | 0.034294 | 0.116071 | 7,8 | 4.294479 |
# feature_map needs the structure's own DSSP categories (df_cat_str)
aa.CPPPlot(df_scales=df_scales_str, df_cat=df_cat_str,
jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_str)
plt.tight_layout()
plt.show()
Arm 3 — embeddings: PLM channels (CPP.run_num)
Fetch per-residue ESM-2 embeddings, normalize them to [0, 1] with
EmbeddingPreprocessor.encode, name the D channels as
pseudo-scales, then run CPP. The features are now learned embedding
channels.
ep = aa.EmbeddingPreprocessor()
emb = ep.fetch_embeddings(df_seq, mode="residue", model="esm2_t6_8M")
dict_emb = ep.encode(df_seq=df_seq, embeddings=emb) # raw -> [0, 1] per channel
df_scales_emb = ep.build_scales(df_seq=df_seq, dict_num=dict_emb) # name the D channels
df_cat_emb = ep.build_cat(df_scales=df_scales_emb, random_state=42)
df_parts_emb, dict_num_parts_emb = aa.NumericalFeature.get_parts(df_seq=df_seq, dict_num=dict_emb)
df_feat_emb = aa.CPP(df_parts=df_parts_emb, df_scales=df_scales_emb, df_cat=df_cat_emb).run_num(
dict_num_parts=dict_num_parts_emb, labels=labels, n_filter=50, n_jobs=1)
df_feat_emb = add_importance(df_feat_emb)
aa.display_df(df_feat_emb, n_rows=10, show_shape=True)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 0%| | 0/102 [00:00<?, ?it/s]
/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_36514/333399537.py:4: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
df_scales_emb = ep.build_scales(df_seq=df_seq, dict_num=dict_emb) # name the D channels
DataFrame shape: (50, 14)
| feature | category | subcategory | scale_name | scale_description | abs_auc | abs_mean_dif | mean_dif | std_test | std_ref | p_val_mann_whitney | p_val_fdr_bh | positions | feat_importance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | JMD_N_TMD_N-Pat...5,8,11)-dim_232 | Embeddings | Embeddings_cat14_subcat206 | dim_232 | 0.500000 | 0.109000 | 0.109000 | 0.025000 | 0.038000 | 0.000157 | 0.497542 | 2,5,8,11 | 2.151926 | |
| 2 | JMD_N_TMD_N-Pat...,12,15)-dim_232 | Embeddings | Embeddings_cat14_subcat206 | dim_232 | 0.500000 | 0.101000 | 0.101000 | 0.033000 | 0.039000 | 0.000157 | 1.000000 | 6,9,13 | 2.151926 | |
| 3 | TMD_C_JMD_C-Pat...C,2,5,9)-dim_10 | Embeddings | Embeddings_cat0_subcat15 | dim_10 | 0.500000 | 0.099000 | -0.099000 | 0.044000 | 0.033000 | 0.000157 | 0.226155 | 32,36,39 | 2.151926 | |
| 4 | JMD_N_TMD_N-Pat...,10,14)-dim_301 | Embeddings | Embeddings_cat51_subcat267 | dim_301 | 0.500000 | 0.080000 | 0.080000 | 0.021000 | 0.033000 | 0.000157 | 0.355387 | 7,11 | 2.151926 | |
| 5 | JMD_N_TMD_N-Pat...N,9,13)-dim_301 | Embeddings | Embeddings_cat51_subcat267 | dim_301 | 0.500000 | 0.080000 | 0.080000 | 0.021000 | 0.033000 | 0.000157 | 0.276412 | 9,13 | 2.151926 | |
| 6 | JMD_N_TMD_N-Pat...,1,5,8)-dim_204 | Embeddings | Embeddings_cat23_subcat179 | dim_204 | 0.500000 | 0.075000 | -0.075000 | 0.024000 | 0.032000 | 0.000157 | 0.310964 | 1,5,8 | 2.151926 | |
| 7 | JMD_N_TMD_N-Pat...,2,5,8)-dim_146 | Embeddings | Embeddings_cat28_subcat132 | dim_146 | 0.490000 | 0.087000 | -0.087000 | 0.035000 | 0.026000 | 0.000212 | 0.240070 | 2,5,8 | 2.108887 | |
| 8 | JMD_N_TMD_N-Seg...nt(2,7)-dim_232 | Embeddings | Embeddings_cat14_subcat206 | dim_232 | 0.490000 | 0.082000 | 0.082000 | 0.024000 | 0.041000 | 0.000212 | 0.280081 | 3,4,5 | 2.108887 | |
| 9 | TMD_C_JMD_C-Seg...(11,13)-dim_107 | Embeddings | Embeddings_cat18_subcat96 | dim_107 | 0.480000 | 0.126000 | -0.126000 | 0.034000 | 0.056000 | 0.000285 | 0.167269 | 36 | 2.065849 | |
| 10 | TMD_C_JMD_C-Pat...n(C,2,5)-dim_75 | Embeddings | Embeddings_cat6_subcat65 | dim_75 | 0.480000 | 0.095000 | 0.095000 | 0.030000 | 0.037000 | 0.000285 | 0.215060 | 36,39 | 2.065849 |
# feature_map needs the embedding's own pseudo-categories (df_cat_emb)
aa.CPPPlot(df_scales=df_scales_emb, df_cat=df_cat_emb,
jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_emb)
plt.tight_layout()
plt.show()
Arm 4 — annotation: UniProt residue features (CPP.run_num)
Fetch curated UniProt residue annotations (phosphorylation,
glycosylation, disulfide bonds, binding/active sites, …), turn them into
per-residue [0, 1] channels with AnnotationPreprocessor.encode,
and run CPP. The features are now biological annotations. (Entries
without the requested annotations are dropped.)
ANNOT_FEATURES = ["phospho", "glyco_n", "disulfide", "binding", "act_site", "mod_res_other"]
ap = aa.AnnotationPreprocessor()
df_annot = ap.fetch_uniprot(df_seq=df_seq, features=ANNOT_FEATURES)
dict_ann = ap.encode(df_seq=df_seq, df_annot=df_annot, features=ANNOT_FEATURES, on_mismatch="drop")
df_seq_ann = df_seq[df_seq["entry"].isin(dict_ann)].reset_index(drop=True)
df_scales_ann = ap.build_scales(df_seq=df_seq_ann, dict_num=dict_ann, features=ANNOT_FEATURES)
df_cat_ann = ap.build_cat(features=ANNOT_FEATURES)
df_parts_ann, dict_num_parts_ann = aa.NumericalFeature.get_parts(df_seq=df_seq_ann, dict_num=dict_ann)
df_feat_ann = aa.CPP(df_parts=df_parts_ann, df_scales=df_scales_ann, df_cat=df_cat_ann).run_num(
dict_num_parts=dict_num_parts_ann, labels=list(df_seq_ann["label"]), n_filter=50, n_jobs=1)
df_feat_ann = add_importance(df_feat_ann)
aa.display_df(df_feat_ann, n_rows=10, show_shape=True)
/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_36514/4093506119.py:7: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq + dict_num). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
df_scales_ann = ap.build_scales(df_seq=df_seq_ann, dict_num=dict_ann, features=ANNOT_FEATURES)
DataFrame shape: (18, 14)
/Users/stephanbreimann/Programming/1Packages/aaanalysis-cpp-representations/aaanalysis/feature_engineering/_backend/cpp_run.py:112: RuntimeWarning: 'n_filter' (50) should be <= the number of features the filter could deliver (18); returning fewer features than requested. Inspect df_feat.attrs['last_filter_stats'] and consider a larger 'df_scales' / 'n_jmd' / less strict 'max_overlap'/'max_cor'.
warnings.warn(
| feature | category | subcategory | scale_name | scale_description | abs_auc | abs_mean_dif | mean_dif | std_test | std_ref | p_val_mann_whitney | p_val_fdr_bh | positions | feat_importance | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | JMD_N_TMD_N-Seg...3,12)-disulfide | PTMs | PTM_disulfide | disulfide | PTMs/PTM_disulfide | 0.100000 | 0.100000 | -0.100000 | 0.000000 | 0.200000 | 0.449692 | 1.000000 | 4,5 | 11.764706 |
| 2 | JMD_N_TMD_N-Seg...(1,3)-disulfide | PTMs | PTM_disulfide | disulfide | PTMs/PTM_disulfide | 0.100000 | 0.029000 | -0.029000 | 0.000000 | 0.057000 | 0.449692 | 1.000000 | 1,2,3,4,5,6 | 11.764706 |
| 3 | TMD-Segment(6,14)-binding | Functional sites | FUNC_binding | binding | Functional sites/FUNC_binding | 0.050000 | 0.100000 | -0.100000 | 0.000000 | 0.300000 | 0.705457 | 1.000000 | 18 | 5.882353 |
| 4 | JMD_N_TMD_N-Pat...(C,1,4)-binding | Functional sites | FUNC_binding | binding | Functional sites/FUNC_binding | 0.050000 | 0.050000 | -0.050000 | 0.000000 | 0.150000 | 0.705457 | 1.000000 | 17,20 | 5.882353 |
| 5 | TMD-Pattern(C,12,15)-binding | Functional sites | FUNC_binding | binding | Functional sites/FUNC_binding | 0.050000 | 0.050000 | -0.050000 | 0.000000 | 0.150000 | 0.705457 | 0.992989 | 16,19 | 5.882353 |
| 6 | TMD-Pattern(C,1,4)-phospho | PTMs | PTM_phosphorylation | phospho | PTMs/PTM_phosphorylation | 0.050000 | 0.050000 | 0.050000 | 0.150000 | 0.000000 | 0.705457 | 1.000000 | 27,30 | 5.882353 |
| 7 | TMD_C_JMD_C-Pat...,11,14)-phospho | PTMs | PTM_phosphorylation | phospho | PTMs/PTM_phosphorylation | 0.050000 | 0.050000 | 0.050000 | 0.150000 | 0.000000 | 0.705457 | 1.000000 | 31,34 | 5.882353 |
| 8 | JMD_N_TMD_N-Pat...,4,7)-disulfide | PTMs | PTM_disulfide | disulfide | PTMs/PTM_disulfide | 0.050000 | 0.033000 | -0.033000 | 0.000000 | 0.100000 | 0.705457 | 1.000000 | 1,4,7 | 5.882353 |
| 9 | JMD_N_TMD_N-Pat...,5,8)-disulfide | PTMs | PTM_disulfide | disulfide | PTMs/PTM_disulfide | 0.050000 | 0.033000 | -0.033000 | 0.000000 | 0.100000 | 0.705457 | 1.000000 | 1,5,8 | 5.882353 |
| 10 | JMD_N_TMD_N-Pat...,5,9)-disulfide | PTMs | PTM_disulfide | disulfide | PTMs/PTM_disulfide | 0.050000 | 0.033000 | -0.033000 | 0.000000 | 0.100000 | 0.705457 | 1.000000 | 2,5,9 | 5.882353 |
# feature_map needs the annotation's own categories (df_cat_ann)
aa.CPPPlot(df_scales=df_scales_ann, df_cat=df_cat_ann,
jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_ann)
plt.tight_layout()
plt.show()
Same machinery, different features
All four maps share the same layout and ``df_feat`` schema — only the value source changed:
Sequence → interpretable AAontology categories (physicochemical signature).
Structure → DSSP channels (secondary structure / accessibility / geometry).
Embeddings → learned PLM channels (what the language model encodes).
Annotation → curated UniProt residue features (known biology).
Because they all produce the same df_feat, any of them feeds the
same downstream tools — CPPPlot, TreeModel,
AAclust.select_proteins, ShapModel. Pick the representation that
matches the question; CPP turns each into an interpretable,
position-aware feature set.