CPP: Identification of physicochemical signatures

Comparative Physicochemical Profiling (CPP) is a sequence-based algorithm for interpretable feature engineering. It is the centerpiece of AAanalysis, introduced in [Breimann25].

You will learn

Tool: CPP
Input: df_parts, labels (+ split_kws and scales)
Output: df_feat (the ranked, non-redundant physicochemical signature)
Best used for: identifying the physicochemical signature that separates two sequence groups
Related protocol: P1: CPP signature
Related API: CPP, CPPPlot

The aim of the CPP algorithm is to identify a set of unique, non-redundant features that are most discriminant between the test and reference group of sequences. We call this feature set and its visual representation the physicochemical signature of the test group, which can be interpreted at group and sample level with single-residue resolution.

We will demonstrate this in three steps:

Feature Creation
Group Level CPP Analysis
Sample Level CPP Analysis

Feature Creation

To create an CPP object, you just need to provide a valid df_parts DataFrame:

import aaanalysis as aa
aa.options["verbose"] = False
aa.options["random_state"] = 42

# Load example dataset
df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()

# Create Parts
df_parts = sf.get_df_parts(df_seq=df_seq)

# Create CPP object and run with default splits and scales
cpp = aa.CPP(df_parts=df_parts)

A non-redundant set of physicochemical features is obtained through the CPP.run() method:

# Run CPP algorithm to obtain 100 features (default)
df_feat = cpp.run(labels=labels, n_filter=100)
aa.display_df(df=df_feat, n_rows=8, show_shape=True)

/Users/stephanbreimann/Programming/1Packages/aaanalysis/.claude/worktrees/pr222-fresh/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(

DataFrame shape: (100, 13)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	positions
1	TMD_C_JMD_C-Seg...2,3)-QIAN880106	Conformation	α-helix	α-helix (middle)	Weights for alp...ejnowski, 1988)	0.387000	0.118000	0.118000	0.068000	0.080000	27,28,29,30,31,32,33
2	TMD_C_JMD_C-Pat...,14)-CRAJ730103	Conformation	β-turn	β-turn	Normalized freq...d et al., 1973)	0.377000	0.285000	-0.285000	0.164000	0.177000	27,31
3	TMD_C_JMD_C-Seg...6,9)-FAUJ880104	Shape	Side chain length	Steric parameter	STERIMOL length...e et al., 1988)	0.367000	0.263000	0.263000	0.161000	0.168000	32,33
4	TMD_C_JMD_C-Seg...6,9)-ONEK900101	Others	Unclassified (Others)	ΔG values in peptides	Delta G values ...-DeGrado, 1990)	0.366000	0.111000	0.111000	0.070000	0.114000	32,33
5	TMD_C_JMD_C-Pat...,15)-QIAN880107	Conformation	α-helix	α-helix (middle)	Weights for alp...ejnowski, 1988)	0.363000	0.162000	0.162000	0.091000	0.118000	24,28,32,35
6	TMD_C_JMD_C-Seg...3,4)-HUTJ700103	Energy	Entropy	Entropy	Entropy of form...Hutchens, 1970)	0.360000	0.187000	0.187000	0.115000	0.128000	31,32,33,34,35
7	TMD_C_JMD_C-Seg...2,3)-WOLS870103	Others	PC 4	Principal Component 3 (Wold)	Principal prope...d et al., 1987)	0.359000	0.159000	-0.159000	0.090000	0.130000	27,28,29,30,31,32,33
8	TMD_C_JMD_C-Pat...,12)-CRAJ730103	Conformation	β-turn	β-turn	Normalized freq...d et al., 1973)	0.352000	0.227000	-0.227000	0.150000	0.170000	24,28,32

CPP Analysis (Group Level)

The CPPPlot class provides various methods for visualizing single to all features. For further analysis, we need to add the group level feature importance using the TreeModel class:

# Create feature matrix
sf = aa.SequenceFeature()
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

# Fit TreeModel
tm = aa.TreeModel()
tm = tm.fit(X, labels=labels)

# Add feature importance, ranked most-important-first via sort=True
df_feat = tm.add_feat_importance(df_feat=df_feat, sort=True)

The top 15 features can be visualized using the CPPPlot.ranking() method

import matplotlib.pyplot as plt

# CPP ranking
cpp_plot = aa.CPPPlot()
aa.plot_settings(weight_bold=False, short_ticks=True)
cpp_plot.ranking(df_feat=df_feat)
plt.tight_layout()
plt.show()

../_images/tutorial3c_cpp_1_output_10_0.png

The difference of feature values between the test and the reference group can be displayed for any selected feature using the CPPPlot.feature() method:

# df_feat is already sorted by feat_importance (sort=True above), so feat_rank=1 is the top feature
# Show feature value distribution for the best feature
aa.plot_settings()
cpp_plot.feature(feature=df_feat, feat_rank=1, df_seq=df_seq, labels=labels)
plt.title(f"{df_feat['feature'][0]} ({df_feat['subcategory'][0]})")
plt.tight_layout()
plt.show()

../_images/tutorial3c_cpp_2_output_12_0.png

The feature value difference can be either positive (the test dataset has higher values, indicated in red) or negative (the test dataset has lower values, indicated in blue):

# feat_rank=2 selects the second-best feature from the ranked df_feat
aa.plot_settings()
cpp_plot.feature(feature=df_feat, feat_rank=2, df_seq=df_seq, labels=labels)
plt.title(f"{df_feat['feature'][1]} ({df_feat['subcategory'][1]})")
plt.tight_layout()
plt.show()

../_images/tutorial3c_cpp_3_output_14_0.png

To visualize the importance of all features at single-residue resolution, the cumulative feature importance per residue position can be shown using the CPPPlot.profile() method:

# Plot CPP profile
aa.plot_settings(font_scale=0.9)
cpp_plot.profile(df_feat=df_feat)
plt.tight_layout()
plt.show()

../_images/tutorial3c_cpp_4_output_16_0.png

The complete feature landscape can be charted using the CPPPlot.feature_map() method. This CPP feature map shows the feature value difference and feature importance per residue and scale subcategory, which are described and discussed in our `AAontology`` (AAontology Usage Principles, [Breimann24b]):

# Plot CPP feature map
cpp_plot = aa.CPPPlot()
aa.plot_settings(font_scale=0.65, weight_bold=False)
cpp_plot.feature_map(df_feat=df_feat)
plt.show()

../_images/tutorial3c_cpp_5_output_18_0.png

CPP Analysis (“Sample Level”)

You can provide individual sequence parts to the plotting methods to translate the results of the group level CPP features onto a specific sample.

# Get sequences parts for APP
seq_kws = sf.get_seq_kws(df_seq=df_seq, df_parts=df_parts, sample="P05067")

Provide the parts as tmd_seq, jmd_n_seq, and jmd_c_seq parameters to the CPPPlot.profile() method:

# Plot CPP profile ("sample level")
aa.plot_settings(font_scale=0.9)
cpp_plot.profile(df_feat=df_feat, **seq_kws)
plt.tight_layout()
plt.show()

../_images/tutorial3c_cpp_6_output_22_0.png

Or to the feature_map() method:

# Plot CPP feature map ("sample level")
cpp_plot = aa.CPPPlot()
aa.plot_settings(font_scale=0.65, weight_bold=False)
cpp_plot.feature_map(df_feat=df_feat, **seq_kws)
plt.show()

../_images/tutorial3c_cpp_7_output_24_0.png

However, these are still the general results but only visualized for a specific sample sequence. To obtain the sample-specific feature value difference and feature impact, see the Explainable AI Tutorial.

For more details on the CPP and CPPPlot classes are given in the Feature Engineering API. To delve deeper into the feature concept, see the CPP Usage Principles section.