CPP across data representations: sequence, structure, embeddings, annotation

A CPP feature is part × split × value: where in the sequence (tmd, jmd_n_tmd_n, tmd_c_jmd_c), how the positions are picked (Segment / Pattern / PeriodicPattern), and what value is summarized there. The value axis has two interchangeable sources that share one grammar and one df_feat schema:

a physicochemical amino acid scale (the sequence arm, run()), and
any per-residue numerical tensor in [0, 1] (the numerical arm, run_num()).

You will learn

Tool: CPP (run / run_num)
Input: df_parts (sequence arm) or a per-residue tensor in [0, 1] (numerical arm)
Output: df_feat (one schema across every representation)
Best used for: applying CPP to scales, structure, embeddings, or annotation through one grammar
Related protocol: P5: Engineer features
Related API: CPP, NumericalFeature

So learned, structural, and annotation-based representations all ride the exact same interpretable CPP machinery. Below we feed four representations of the same proteins into CPP and chart each one’s feature_map() — the same map layout, but a different kind of feature each time:

sequence — amino acid scales,
structure — AlphaFold/DSSP channels,
embeddings — ESM-2 protein-language-model channels,
annotation — UniProt residue annotations.

(The structure / embedding / annotation arms need the aaanalysis[pro] / aaanalysis[embed] extras.)

import tempfile
import aaanalysis as aa
import matplotlib.pyplot as plt
aa.options["verbose"] = False
aa.plot_settings()

# Same small protein set for every representation. n_jobs=1 keeps CPP
# deterministic and avoids the macOS process-spawn footgun.
df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
labels = list(df_seq["label"])

def add_importance(df_feat):
    # Model-free importance for the map: each feature's discriminative power
    # (abs_auc), normalized to %. Swap in TreeModel.add_feat_importance for a
    # model-based ranking once you have a per-protein feature matrix.
    df_feat = df_feat.copy()
    df_feat["feat_importance"] = 100 * df_feat["abs_auc"] / df_feat["abs_auc"].sum()
    return df_feat

Arm 1 — sequence: physicochemical scales (`run()`)

The standard CPP: split the sequence parts and summarize amino acid scale values. The feature categories are the AAontology subcategories.

sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_feat_seq = aa.CPP(df_scales=aa.load_scales(), df_parts=df_parts).run(
    labels=labels, n_filter=50, n_jobs=1)
df_feat_seq = add_importance(df_feat_seq)
aa.display_df(df_feat_seq, n_rows=10, show_shape=True)

[94mCPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.[0m
DataFrame shape: (50, 14)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions	feat_importance
1	JMD_N_TMD_N-Seg...,10)-ZIMJ680101	Polarity	Hydrophobicity	Hydrophobicity	Hydrophobicity ...n et al., 1968)	0.500000	0.361000	0.361000	0.156000	0.150000	0.000157	1.000000	1,2	2.273244
2	JMD_N_TMD_N-Pat...,12)-PALJ810110	Conformation	β-sheet	β-sheet	Normalized freq...u et al., 1981)	0.470000	0.233000	-0.233000	0.092000	0.095000	0.000381	1.000000	9,13,16,19	2.136849
3	TMD-Pattern(N,1...,11)-PALJ810110	Conformation	β-sheet	β-sheet	Normalized freq...u et al., 1981)	0.470000	0.233000	-0.233000	0.092000	0.095000	0.000381	1.000000	11,15,18,21	2.136849
4	TMD_C_JMD_C-Pat...,12)-TANS770105	Conformation	β-turn (C-term)	β-turn (3rd residue)	Normalized freq...Scheraga, 1977)	0.470000	0.230000	-0.230000	0.061000	0.111000	0.000381	1.000000	24,28,32	2.136849
5	TMD_C_JMD_C-Pat...,15)-AURR980102	Conformation	Linker (6-14 AA)	α-helix (N-terminal, outside)	Normalized posi...ora-Rose, 1998)	0.465000	0.189000	0.189000	0.054000	0.099000	0.000440	1.000000	26,29,33	2.114117
6	TMD_C_JMD_C-Pat...,14)-AURR980102	Conformation	Linker (6-14 AA)	α-helix (N-terminal, outside)	Normalized posi...ora-Rose, 1998)	0.465000	0.189000	0.189000	0.054000	0.099000	0.000440	1.000000	27,30,34	2.114117
7	TMD-Pattern(C,4,8)-CHOP780201	Conformation	α-helix	α-helix	Normalized freq...-Fasman, 1978b)	0.460000	0.279000	0.279000	0.095000	0.176000	0.000507	0.816292	23,27	2.091384
8	TMD_C_JMD_C-Pat...4,8)-CHOP780201	Conformation	α-helix	α-helix	Normalized freq...-Fasman, 1978b)	0.460000	0.279000	0.279000	0.095000	0.176000	0.000507	1.000000	24,28	2.091384
9	TMD-Pattern(C,4,8)-RACS820114	Shape	Graph (2. eigenvalue)	Side chain angle (Theta)	Value of theta(...Scheraga, 1982)	0.460000	0.201000	-0.201000	0.044000	0.133000	0.000507	1.000000	23,27	2.091384
10	TMD_C_JMD_C-Pat...4,8)-RACS820114	Shape	Graph (2. eigenvalue)	Side chain angle (Theta)	Value of theta(...Scheraga, 1982)	0.460000	0.201000	-0.201000	0.044000	0.133000	0.000507	0.979550	24,28	2.091384

aa.CPPPlot(jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_seq)
plt.tight_layout()
plt.show()

../_images/tutorial3d_data_representations_1_output_6_0.png

Arm 2 — structure: AlphaFold / DSSP (`run_num()`)

Fetch AlphaFold structures, derive per-residue DSSP channels (secondary structure, relative accessible surface area, backbone angles) with encode_dssp(), and run the identical CPP algorithm via run_num(). The features are now structural channels. (encode_dssp drops any entry without a structure; we run CPP on the structured subset.)

STRUCT_FEATURES = ["ss3", "rasa", "phi_psi_sincos"]
stp = aa.StructurePreprocessor()
pdb_folder = tempfile.mkdtemp()
stp.fetch_alphafold(df_seq=df_seq, out_folder=pdb_folder, on_failure="nan")
dict_str = stp.encode_dssp(df_seq=df_seq, pdb_folder=pdb_folder, features=STRUCT_FEATURES, on_failure="drop")

df_seq_str = df_seq[df_seq["entry"].isin(dict_str)].reset_index(drop=True)
df_scales_str = stp.build_scales(df_seq=df_seq_str, dict_num=dict_str, features=STRUCT_FEATURES)
df_cat_str = stp.build_cat(features=STRUCT_FEATURES)
df_parts_str, dict_num_parts_str = aa.NumericalFeature.get_parts(df_seq=df_seq_str, dict_num=dict_str)
df_feat_str = aa.CPP(df_parts=df_parts_str, df_scales=df_scales_str, df_cat=df_cat_str).run_num(
    dict_num_parts=dict_num_parts_str, labels=list(df_seq_str["label"]), n_filter=50, n_jobs=1)
df_feat_str = add_importance(df_feat_str)
aa.display_df(df_feat_str, n_rows=10, show_shape=True)

/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/Users/stephanbreimann/Programming/1Packages/aaanalysis/.venv/lib/python3.13/site-packages/Bio/PDB/DSSP.py:199: UserWarning: parse error at line 1: This file does not seem to be an mmCIF file

  warnings.warn(err)
/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_36514/3891309445.py:8: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq + dict_num). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales_str = stp.build_scales(df_seq=df_seq_str, dict_num=dict_str, features=STRUCT_FEATURES)

DataFrame shape: (27, 14)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions	feat_importance
1	JMD_N_TMD_N-Seg...nt(1,2)-psi_sin	Structure	Backbone dihedral (sin/cos)	psi_sin	Structure/Backb...edral (sin/cos)	0.420000	0.183000	0.183000	0.185000	0.192000	0.001499	0.593554	1,2,3,4,5,6,7,8,9,10	6.441718
2	JMD_N_TMD_N-Pat...ern(N,2,5)-rasa	Structure	Relative ASA (Tien)	rasa	Structure/Relative ASA (Tien)	0.350000	0.238000	0.238000	0.164000	0.159000	0.008151	0.084942	2,5	5.368098
3	JMD_N_TMD_N-Segment(1,5)-rasa	Structure	Relative ASA (Tien)	rasa	Structure/Relative ASA (Tien)	0.340000	0.162000	0.162000	0.193000	0.132000	0.010165	0.087509	1,2,3,4	5.214724
4	JMD_N_TMD_N-Per...+3/3,1)-ss_coil	Structure	Secondary structure (3-state)	ss_coil	Structure/Secon...cture (3-state)	0.335000	0.126000	-0.126000	0.038000	0.102000	0.011330	0.083084	2,5,8,11,14,17,20	5.138037
5	JMD_N_TMD_N-Per...+3/3,1)-ss_coil	Structure	Secondary structure (3-state)	ss_coil	Structure/Secon...cture (3-state)	0.335000	0.126000	-0.126000	0.038000	0.102000	0.011330	0.081574	1,4,7,10,13,16,19	5.138037
6	JMD_N_TMD_N-Pat...5,8,11)-psi_cos	Structure	Backbone dihedral (sin/cos)	psi_cos	Structure/Backb...edral (sin/cos)	0.330000	0.211000	-0.211000	0.193000	0.145000	0.012611	0.076831	1,5,8,11	5.061350
7	JMD_N_TMD_N-Per...+4/3,4)-psi_cos	Structure	Backbone dihedral (sin/cos)	psi_cos	Structure/Backb...edral (sin/cos)	0.330000	0.120000	-0.120000	0.107000	0.098000	0.012611	0.079270	3,6,10,13,17	5.061350
8	JMD_N_TMD_N-Pat...(N,2,6,10)-rasa	Structure	Relative ASA (Tien)	rasa	Structure/Relative ASA (Tien)	0.320000	0.098000	0.098000	0.171000	0.105000	0.015564	0.071669	2,6,10	4.907975
9	JMD_N_TMD_N-Pat...,1,4,7,10)-rasa	Structure	Relative ASA (Tien)	rasa	Structure/Relative ASA (Tien)	0.300000	0.100000	0.100000	0.129000	0.104000	0.023342	0.088033	1,4,7,10	4.601227
10	JMD_N_TMD_N-Seg...t(6,15)-phi_sin	Structure	Backbone dihedral (sin/cos)	phi_sin	Structure/Backb...edral (sin/cos)	0.280000	0.091000	0.091000	0.194000	0.027000	0.034294	0.116071	7,8	4.294479

# feature_map needs the structure's own DSSP categories (df_cat_str)
aa.CPPPlot(df_scales=df_scales_str, df_cat=df_cat_str,
           jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_str)
plt.tight_layout()
plt.show()

../_images/tutorial3d_data_representations_2_output_9_0.png

Arm 3 — embeddings: PLM channels (`run_num()`)

Fetch per-residue ESM-2 embeddings, normalize them to [0, 1] with encode(), name the D channels as pseudo-scales, then run CPP. The features are now learned embedding channels.

ep = aa.EmbeddingPreprocessor()
emb = ep.fetch_embeddings(df_seq, mode="residue", model="esm2_t6_8M")
dict_emb = ep.encode(df_seq=df_seq, embeddings=emb)              # raw -> [0, 1] per channel
df_scales_emb = ep.build_scales(df_seq=df_seq, dict_num=dict_emb)  # name the D channels
df_cat_emb = ep.build_cat(df_scales=df_scales_emb, random_state=42)
df_parts_emb, dict_num_parts_emb = aa.NumericalFeature.get_parts(df_seq=df_seq, dict_num=dict_emb)
df_feat_emb = aa.CPP(df_parts=df_parts_emb, df_scales=df_scales_emb, df_cat=df_cat_emb).run_num(
    dict_num_parts=dict_num_parts_emb, labels=labels, n_filter=50, n_jobs=1)
df_feat_emb = add_importance(df_feat_emb)
aa.display_df(df_feat_emb, n_rows=10, show_shape=True)

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

Loading weights:   0%|          | 0/102 [00:00<?, ?it/s]

/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_36514/333399537.py:4: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales_emb = ep.build_scales(df_seq=df_seq, dict_num=dict_emb)  # name the D channels

DataFrame shape: (50, 14)

	feature	category	subcategory	scale_name	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions	feat_importance
1	JMD_N_TMD_N-Pat...5,8,11)-dim_232	Embeddings	Embeddings_cat14_subcat206	dim_232	0.500000	0.109000	0.109000	0.025000	0.038000	0.000157	0.497542	2,5,8,11	2.151926
2	JMD_N_TMD_N-Pat...,12,15)-dim_232	Embeddings	Embeddings_cat14_subcat206	dim_232	0.500000	0.101000	0.101000	0.033000	0.039000	0.000157	1.000000	6,9,13	2.151926
3	TMD_C_JMD_C-Pat...C,2,5,9)-dim_10	Embeddings	Embeddings_cat0_subcat15	dim_10	0.500000	0.099000	-0.099000	0.044000	0.033000	0.000157	0.226155	32,36,39	2.151926
4	JMD_N_TMD_N-Pat...,10,14)-dim_301	Embeddings	Embeddings_cat51_subcat267	dim_301	0.500000	0.080000	0.080000	0.021000	0.033000	0.000157	0.355387	7,11	2.151926
5	JMD_N_TMD_N-Pat...N,9,13)-dim_301	Embeddings	Embeddings_cat51_subcat267	dim_301	0.500000	0.080000	0.080000	0.021000	0.033000	0.000157	0.276412	9,13	2.151926
6	JMD_N_TMD_N-Pat...,1,5,8)-dim_204	Embeddings	Embeddings_cat23_subcat179	dim_204	0.500000	0.075000	-0.075000	0.024000	0.032000	0.000157	0.310964	1,5,8	2.151926
7	JMD_N_TMD_N-Pat...,2,5,8)-dim_146	Embeddings	Embeddings_cat28_subcat132	dim_146	0.490000	0.087000	-0.087000	0.035000	0.026000	0.000212	0.240070	2,5,8	2.108887
8	JMD_N_TMD_N-Seg...nt(2,7)-dim_232	Embeddings	Embeddings_cat14_subcat206	dim_232	0.490000	0.082000	0.082000	0.024000	0.041000	0.000212	0.280081	3,4,5	2.108887
9	TMD_C_JMD_C-Seg...(11,13)-dim_107	Embeddings	Embeddings_cat18_subcat96	dim_107	0.480000	0.126000	-0.126000	0.034000	0.056000	0.000285	0.167269	36	2.065849
10	TMD_C_JMD_C-Pat...n(C,2,5)-dim_75	Embeddings	Embeddings_cat6_subcat65	dim_75	0.480000	0.095000	0.095000	0.030000	0.037000	0.000285	0.215060	36,39	2.065849

# feature_map needs the embedding's own pseudo-categories (df_cat_emb)
aa.CPPPlot(df_scales=df_scales_emb, df_cat=df_cat_emb,
           jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_emb)
plt.tight_layout()
plt.show()

../_images/tutorial3d_data_representations_3_output_12_0.png

Arm 4 — annotation: UniProt residue features (`run_num()`)

Fetch curated UniProt residue annotations (phosphorylation, glycosylation, disulfide bonds, binding/active sites, …), turn them into per-residue [0, 1] channels with encode(), and run CPP. The features are now biological annotations. (Entries without the requested annotations are dropped.)

ANNOT_FEATURES = ["phospho", "glyco_n", "disulfide", "binding", "act_site", "mod_res_other"]
ap = aa.AnnotationPreprocessor()
df_annot = ap.fetch_uniprot(df_seq=df_seq, features=ANNOT_FEATURES)
dict_ann = ap.encode(df_seq=df_seq, df_annot=df_annot, features=ANNOT_FEATURES, on_mismatch="drop")

df_seq_ann = df_seq[df_seq["entry"].isin(dict_ann)].reset_index(drop=True)
df_scales_ann = ap.build_scales(df_seq=df_seq_ann, dict_num=dict_ann, features=ANNOT_FEATURES)
df_cat_ann = ap.build_cat(features=ANNOT_FEATURES)
df_parts_ann, dict_num_parts_ann = aa.NumericalFeature.get_parts(df_seq=df_seq_ann, dict_num=dict_ann)
df_feat_ann = aa.CPP(df_parts=df_parts_ann, df_scales=df_scales_ann, df_cat=df_cat_ann).run_num(
    dict_num_parts=dict_num_parts_ann, labels=list(df_seq_ann["label"]), n_filter=50, n_jobs=1)
df_feat_ann = add_importance(df_feat_ann)
aa.display_df(df_feat_ann, n_rows=10, show_shape=True)

/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_36514/4093506119.py:7: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq + dict_num). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales_ann = ap.build_scales(df_seq=df_seq_ann, dict_num=dict_ann, features=ANNOT_FEATURES)

DataFrame shape: (18, 14)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions	feat_importance
1	JMD_N_TMD_N-Seg...3,12)-disulfide	PTMs	PTM_disulfide	disulfide	PTMs/PTM_disulfide	0.100000	0.100000	-0.100000	0.000000	0.200000	0.449692	1.000000	4,5	11.764706
2	JMD_N_TMD_N-Seg...(1,3)-disulfide	PTMs	PTM_disulfide	disulfide	PTMs/PTM_disulfide	0.100000	0.029000	-0.029000	0.000000	0.057000	0.449692	1.000000	1,2,3,4,5,6	11.764706
3	TMD-Segment(6,14)-binding	Functional sites	FUNC_binding	binding	Functional sites/FUNC_binding	0.050000	0.100000	-0.100000	0.000000	0.300000	0.705457	1.000000	18	5.882353
4	JMD_N_TMD_N-Pat...(C,1,4)-binding	Functional sites	FUNC_binding	binding	Functional sites/FUNC_binding	0.050000	0.050000	-0.050000	0.000000	0.150000	0.705457	1.000000	17,20	5.882353
5	TMD-Pattern(C,12,15)-binding	Functional sites	FUNC_binding	binding	Functional sites/FUNC_binding	0.050000	0.050000	-0.050000	0.000000	0.150000	0.705457	0.992989	16,19	5.882353
6	TMD-Pattern(C,1,4)-phospho	PTMs	PTM_phosphorylation	phospho	PTMs/PTM_phosphorylation	0.050000	0.050000	0.050000	0.150000	0.000000	0.705457	1.000000	27,30	5.882353
7	TMD_C_JMD_C-Pat...,11,14)-phospho	PTMs	PTM_phosphorylation	phospho	PTMs/PTM_phosphorylation	0.050000	0.050000	0.050000	0.150000	0.000000	0.705457	1.000000	31,34	5.882353
8	JMD_N_TMD_N-Pat...,4,7)-disulfide	PTMs	PTM_disulfide	disulfide	PTMs/PTM_disulfide	0.050000	0.033000	-0.033000	0.000000	0.100000	0.705457	1.000000	1,4,7	5.882353
9	JMD_N_TMD_N-Pat...,5,8)-disulfide	PTMs	PTM_disulfide	disulfide	PTMs/PTM_disulfide	0.050000	0.033000	-0.033000	0.000000	0.100000	0.705457	1.000000	1,5,8	5.882353
10	JMD_N_TMD_N-Pat...,5,9)-disulfide	PTMs	PTM_disulfide	disulfide	PTMs/PTM_disulfide	0.050000	0.033000	-0.033000	0.000000	0.100000	0.705457	1.000000	2,5,9	5.882353

# feature_map needs the annotation's own categories (df_cat_ann)
aa.CPPPlot(df_scales=df_scales_ann, df_cat=df_cat_ann,
           jmd_n_len=10, jmd_c_len=10).feature_map(df_feat=df_feat_ann)
plt.tight_layout()
plt.show()

../_images/tutorial3d_data_representations_4_output_15_0.png

Same machinery, different features

All four maps share the same layout and ``df_feat`` schema — only the value source changed:

Sequence → interpretable AAontology categories (physicochemical signature).
Structure → DSSP channels (secondary structure / accessibility / geometry).
Embeddings → learned PLM channels (what the language model encodes).
Annotation → curated UniProt residue features (known biology).

Because they all produce the same df_feat, any of them feeds the same downstream tools — CPPPlot, TreeModel, select_proteins(), ShapModel. Pick the representation that matches the question; CPP turns each into an interpretable, position-aware feature set.

CPP across data representations: sequence, structure, embeddings, annotation

Arm 1 — sequence: physicochemical scales (run())

Arm 2 — structure: AlphaFold / DSSP (run_num())

Arm 3 — embeddings: PLM channels (run_num())

Arm 4 — annotation: UniProt residue features (run_num())

Same machinery, different features

Arm 1 — sequence: physicochemical scales (`run()`)

Arm 2 — structure: AlphaFold / DSSP (`run_num()`)

Arm 3 — embeddings: PLM channels (`run_num()`)

Arm 4 — annotation: UniProt residue features (`run_num()`)