A minimal CPP analysis
The shortest complete loop in AAanalysis: load a dataset, run Comparative Physicochemical Profiling (CPP), and read out the physicochemical signature that distinguishes the two groups.
We use the domain-level γ-secretase dataset (DOM_GSEC: substrates
vs. non-substrates), so this is the domain level row of the
Prediction
tasks
table — the unit of comparison is the TMD part-set and the reference is
built from labeled A-vs-B groups. For the broader tour (machine
learning, SHAP feature impact, the comparison harness) see the Quick
start tutorial.
import matplotlib.pyplot as plt
import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False
aa.options["random_state"] = 42
1. Load a domain-level dataset
load_dataset returns a sequence table (df_seq) with the TMD
bounds; load_scales returns the amino-acid scales CPP profiles over.
The binary label column is the A-vs-B grouping CPP contrasts.
df_seq = aa.load_dataset(name="DOM_GSEC", n=50)
labels = df_seq["label"].to_list()
df_scales = aa.load_scales()
aa.display_df(df=df_seq, n_rows=10, show_shape=True)
DataFrame shape: (100, 8)
| entry | sequence | label | tmd_start | tmd_stop | jmd_n | tmd | jmd_c | |
|---|---|---|---|---|---|---|---|---|
| 1 | Q14802 | MQKVTLGLLVFLAGF...PGETPPLITPGSAQS | 0 | 37 | 59 | NSPFYYDWHS | LQVGGLICAGVLCAMGIIIVMSA | KCKCKFGQKS |
| 2 | Q86UE4 | MAARSWQDELAQQAE...SPKQIKKKKKARRET | 0 | 50 | 72 | LGLEPKRYPG | WVILVGTGALGLLLLFLLGYGWA | AACAGARKKR |
| 3 | Q969W9 | MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL | 0 | 41 | 63 | FQSMEITELE | FVQIIIIVVVMMVMVVVITCLLS | HYKLSARSFI |
| 4 | P53801 | MAPGVARGPTPYWRL...GLFKEENPYARFENN | 0 | 97 | 119 | RWGVCWVNFE | ALIITMSVVGGTLLLGIAICCCC | CCRRKRSRKP |
| 5 | Q8IUW5 | MAPRALPGSAVLAAA...EVPATPVKRERSGTE | 0 | 59 | 81 | NDTGNGHPEY | IAYALVPVFFIMGLFGVLICHLL | KKKGYRCTTE |
| 6 | P01135 | MVPSAGQLALFALGI...LLKGRTACCHSETVV | 0 | 99 | 121 | AVVAASQKKQ | AITALVVVSIVALAVLIITCVLI | HCCQVRKHCE |
| 7 | O43914 | MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK | 0 | 42 | 64 | DCSCSTVSPG | VLAGIVMGDLVLTVLIALAVYFL | GRLVPRGRGA |
| 8 | P05556 | MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK | 0 | 729 | 751 | ENPECPTGPD | IIPIVAGVVAGIVLIGLALLLIW | KLLMIIHDRR |
| 9 | P16234 | MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL | 0 | 527 | 549 | VAPTLRSELT | VAAAVLVLLVIVIISLIVLVVIW | KQKPRYEIRW |
| 10 | P50895 | MEPPDAPAQARGAPR...SGGARGGSGGFGDEC | 0 | 549 | 571 | TVSPQTSQAG | VAVMAVAVSVGLLLLVVAVFYCV | RRKGGPCCRQ |
2. Reduce scale redundancy with AAclust
Hundreds of amino-acid scales are highly correlated. AAclust
clusters them and keeps one representative (medoid) per cluster, so
CPP profiles a compact, non-redundant scale set. Here we fix
n_clusters=50 for a quick, deterministic run; AAclust can also
choose the cluster count automatically.
aac = aa.AAclust()
X = np.array(df_scales).T
scales = aac.fit(X, names=list(df_scales), n_clusters=50).medoid_names_
df_scales = df_scales[scales]
3. Run CPP
SequenceFeature builds the parts (the TMD and its juxtamembrane
flanks) that CPP profiles. CPP.run then creates all Part × Split ×
Scale features, contrasts the two groups, and returns the ranked,
non-redundant feature table df_feat.
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
cpp = aa.CPP(df_scales=df_scales, df_parts=df_parts)
df_feat = cpp.run(labels=labels)
aa.display_df(df=df_feat, n_rows=10, show_shape=True)
DataFrame shape: (100, 13)
| feature | category | subcategory | scale_name | scale_description | abs_auc | abs_mean_dif | mean_dif | std_test | std_ref | p_val_mann_whitney | p_val_fdr_bh | positions | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | TMD_C_JMD_C-Seg...4,5)-ZIMJ680104 | Energy | Isoelectric point | Isoelectric point | Isoelectric poi...n et al., 1968) | 0.373000 | 0.220000 | 0.220000 | 0.124000 | 0.137000 | 0.000000 | 0.000000 | 33,34,35,36 |
| 2 | TMD_C_JMD_C-Seg...6,9)-ZIMJ680104 | Energy | Isoelectric point | Isoelectric point | Isoelectric poi...n et al., 1968) | 0.341000 | 0.264000 | 0.264000 | 0.187000 | 0.172000 | 0.000000 | 0.000002 | 32,33 |
| 3 | TMD-Segment(11,12)-ROBB760113 | Conformation | β-turn | β-turn | Information mea...n-Suzuki, 1976) | 0.337000 | 0.319000 | -0.319000 | 0.175000 | 0.256000 | 0.000000 | 0.000003 | 27,28 |
| 4 | TMD_C_JMD_C-Seg...2,2)-ZIMJ680104 | Energy | Isoelectric point | Isoelectric point | Isoelectric poi...n et al., 1968) | 0.337000 | 0.106000 | 0.106000 | 0.071000 | 0.082000 | 0.000000 | 0.000002 | 31,32,33,34,35,36,37,38,39,40 |
| 5 | TMD_C_JMD_C-Seg...5,7)-FAUJ880103 | ASA/Volume | Volume | Volume | Normalized van ...e et al., 1988) | 0.334000 | 0.174000 | 0.174000 | 0.109000 | 0.139000 | 0.000000 | 0.000002 | 32,33,34 |
| 6 | TMD_C_JMD_C-Pat...4,8)-CHOC760104 | ASA/Volume | Buried | Buried | Proportion of r...(Chothia, 1976) | 0.326000 | 0.309000 | -0.309000 | 0.165000 | 0.274000 | 0.000000 | 0.000002 | 33,37 |
| 7 | TMD_C_JMD_C-Pat...,12)-PALJ810102 | Conformation | α-helix | α-helix | Normalized freq...u et al., 1981) | 0.325000 | 0.159000 | 0.159000 | 0.119000 | 0.136000 | 0.000000 | 0.000002 | 24,28,32 |
| 8 | TMD_C_JMD_C-Seg...,15)-MITS020101 | Polarity | Amphiphilicity | Amphiphilicity | Amphiphilicity ...u et al., 2002) | 0.324000 | 0.237000 | 0.237000 | 0.174000 | 0.171000 | 0.000000 | 0.000002 | 33 |
| 9 | TMD_C_JMD_C-Pat...4,8)-GUYH850101 | Composition | MPs (anchor) | Partition energy | Partition energy (Guy, 1985) | 0.323000 | 0.242000 | 0.242000 | 0.135000 | 0.219000 | 0.000000 | 0.000002 | 33,37 |
| 10 | TMD_C_JMD_C-Seg...,10)-MITS020101 | Polarity | Amphiphilicity | Amphiphilicity | Amphiphilicity ...u et al., 2002) | 0.323000 | 0.235000 | 0.235000 | 0.174000 | 0.171000 | 0.000000 | 0.000002 | 33,34 |
4. Rank and read out the signature
A TreeModel scores how much each feature contributes to telling the
two groups apart (group-level feature importance). CPPPlot.ranking
then shows the top features — each an interpretable, residue-grounded
Part × Split × Scale combination — which together form the
physicochemical signature of the substrates.
X = sf.feature_matrix(df_parts=df_parts, features=df_feat["feature"])
tm = aa.TreeModel()
tm.fit(X, labels=labels)
df_feat = tm.add_feat_importance(df_feat=df_feat, sort=True)
cpp_plot = aa.CPPPlot()
aa.plot_settings(short_ticks=True, weight_bold=False)
cpp_plot.ranking(df_feat=df_feat, n_top=10)
plt.tight_layout()
plt.show()
5. Map the full signature
CPPPlot.ranking reads out the top features one at a time;
CPPPlot.feature_map charts the whole signature in a single figure
— every selected Part × Split × Scale feature placed by scale
subcategory (y-axis) and residue position (x-axis), colored by the group
mean difference and marked by the same feature importance. It is the
most complete single-figure read-out of a CPP analysis, and the canvas
the Quick start tutorial later overlays SHAP feature impact onto
(shap_plot=True).
aa.plot_settings(font_scale=0.65, weight_bold=False)
cpp_plot.feature_map(df_feat=df_feat)
plt.show()
That is the whole loop: data → CPP → signature. To pick the right setup for your task — residue, domain, or protein level — see the Prediction tasks page, and the Protocols for end-to-end workflows.