CPP

class CPP(df_parts, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None, bootstrap=False, bootstrap_kws=None)[source]

Bases: Tool

Comparative Physicochemical Profiling (CPP) class to create and filter features that are most discriminant between two sets of sequences [Breimann25].

CPP aims at identifying a set of non-redundant features that are most discriminant between the test and reference group of sequences.

Added in version 0.1.0.

last_filter_stats_

Filter-funnel counts from the most recent run() / run_num() / run_composit() (n_candidates, n_after_prefilter, n_after_redundancy, n_final); None before the first call.

Type:: dict

Notes

Parameters ending in _kws (e.g. split_kws, bootstrap_kws) bundle related keyword arguments into one dict; see the keyword-dict parameters overview.

Parameters:

df_parts (DataFrame)
split_kws (Optional[dict])
df_scales (Optional[DataFrame])
df_cat (Optional[DataFrame])
accept_gaps (bool)
verbose (bool)
random_state (Optional[int])
bootstrap (bool)
bootstrap_kws (Optional[dict])

Methods

`eval`(list_df_feat, labels[, label_test, ...])	Evaluate the quality of different sets of identified Comparative Physicochemical Profiling (CPP) features.
`run`(labels[, label_test, label_ref, ...])	Perform Comparative Physicochemical Profiling (CPP) algorithm: creation and two-step filtering of interpretable sequence-based features.
`run_composit`(labels[, composition, k, ...])	Composition-mode CPP: build a `df_feat` of composition features (a special, non-positional feature type) instead of positional Part-Split-Scale features.
`run_num`(dict_num_parts, labels[, ...])	Numerical-mode Comparative Physicochemical Profiling (CPP): same algorithm as `run()`, but per-residue values come from a pre-sliced numerical tensor (dict_num_parts) instead of an AA→scale lookup.
`simplify`(df_feat, labels[, strategy, ...])	Simplify a feature set by swapping scales for more interpretable correlated ones.

__init__(df_parts, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None, bootstrap=False, bootstrap_kws=None)[source]

Parameters:

df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts.
split_kws (dict, optional) – Dictionary with parameter dictionary for each chosen split_type. Default from SequenceFeature.get_split_kws(). If a sequence part in df_parts is too short for the requested splits (e.g. a free peptide with no flanking context), the split lengths are auto-capped to the shortest part (Segment n_split_max capped; Pattern / PeriodicPattern that cannot fit are dropped) and one UserWarning is emitted; the capped split_kws is stored as self.split_kws. For parts long enough for the requested splits this is a no-op.
df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].
df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from df_scales. Default from load_scales() with name='scales_cat', unless specified in options['df_cat'].
accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True). Combined with SequencePreprocessor.pad_parts(), this enables analyzing short, variable-length sequences at a uniform, finer n_split_max than the shortest real sequence allows (a padded part is longer, so more splits fit).
verbose (bool, default=True) – If True, verbose outputs are enabled.
random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random. Also seeds the bootstrap resampling (bootstrap=True).
bootstrap (bool, default=False) – Whether to add bootstrap stability annotation to the selection. False (default) runs the single-pass selection, so run() / run_num() / run_composit() behave exactly as before (output byte-identical) and bootstrap_kws is ignored. True wraps the ordinary run: the data is resampled bootstrap_kws['rounds'] times and re-selected each round to score how often each feature is selected, then the ordinary full-data selection is returned with a ``selection_frequency`` column (0 to 1) added. The selected features are exactly those of a normal run (n_filter is the selection criterion) — bootstrapping annotates their stability, it does not change which features are selected.
bootstrap_kws (dict, optional) –
Bootstrap configuration (only used when bootstrap=True). A dict with any subset of these keys; unset keys keep their tuned default, and None uses all defaults:
- 'rounds' (int, default 20): number of resampling rounds (>=1). More rounds give a more precise selection_frequency estimate at a roughly linear cost; ~20 to ~50 is typically enough.
- 'resample' ({‘both’, ‘reference’, ‘test’}, default 'reference'): which class group is resampled each round. 'reference' fixes the test group and resamples only the reference group (isolating the dominant source of selection instability); 'both' resamples both; 'test' resamples only the test group.
- 'frac' (float, default 0.8): per-group resample size as a fraction of the group’s samples (0<frac<=1), drawn with replacement each round. 0.8 is the conventional sub-sample size; with n_filter as the final cut the exact fraction only modestly affects the result.

Notes

All scales from df_scales must be contained in df_cat
Stability annotation (``bootstrap=True``) is a cross-cutting wrapper, configured once on the object and applied uniformly by run(), run_num(), and run_composit(). It is a thin wrapper: it re-runs the ordinary selection on bootstrap_kws['rounds'] resamples of the data to score how often each feature is selected, then returns the ordinary full-data selection with a per-feature selection_frequency (0 to 1) added. The selected feature list is exactly a normal run (n_filter is the criterion); the annotation flags which of those features are reproducible under resampling vs sample-specific — a trust / interpretability aid, not a change to the list or to predictive accuracy. To keep any downstream cross-validation leakage-safe, run CPP (bootstrapped or not) inside each training fold, never on the full dataset before splitting.
Choosing the settings. bootstrap=True uses the tuned defaults in bootstrap_kws (rounds=20, frac=0.8, resample='reference'); pass a dict to override any of them. rounds controls how precisely selection_frequency is estimated (~20 is a practical sweet spot, ~50 converges the estimate) at a roughly linear cost; frac=0.8 is the conventional sub-sample size and a robust default; resample='reference' resamples only the (usually larger, noisier) reference group.
Splits auto-cap to the shortest part. A sequence part of length L can carry a Segment with at most n_split_max = L pieces, a Pattern only if len_max <= L, and a PeriodicPattern only if its first step <= L. When df_parts contains a part too short for the requested split_kws (typically free peptides / short domains with no flanking context), CPP caps the Segment n_split_max and drops the Pattern / PeriodicPattern split types that cannot fit (Segment is always kept), emits one UserWarning, and stores the capped split_kws as self.split_kws so both run() and run_num() use it. This never raises; for parts long enough for the requested splits it is a no-op and the output is unchanged.
CPP is intrinsically binary (one test group vs one reference group). For multi-class or regression tasks, do not change CPP: transform the target into binary contrasts with the SequenceFeature.get_labels_* helpers and loop run() (or run_num()) over them. Use SequenceFeature.get_labels_ovr() / SequenceFeature.get_labels_ovo() for multi-class and SequenceFeature.get_labels_quantile() / SequenceFeature.get_labels_tiered() for regression. The row-dropping helpers (ovo / tiered) return the row-matched df_parts / dict_num_parts per contrast, ready to drop straight into a new CPP. See the P8: Prediction protocol for the end-to-end workflow.

See also

CPPPlot: the respective plotting class.
SequenceFeature for definition of sequence Parts.
SequenceFeature.split_kws() for definition of Splits key word arguments.
load_scales() for definition of amino acid Scales and their categories.
SequenceFeature.get_labels_* (e.g. SequenceFeature.get_labels_ovr()): build multi-class / regression label contrasts to drive CPP.

Examples

To create an CPP object, you just need to provide a valid df_parts DataFrame:

import aaanalysis as aa
df_seq = aa.load_dataset(name="DOM_GSEC", n=50)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
# Create CPP object
cpp = aa.CPP(df_parts=df_parts)

You can adjust Parts, Splits, and Scales as follows:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd_jmd"])
split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=1)
df_scales = aa.load_scales()
scales = list(df_scales)[0:10]
# Create CPP object for Segments over the complete TMD-JMD with 10 first scales
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales[scales])

The CPP constructor also accepts df_cat (the scale-category table used by the redundancy filter and plotting; loaded automatically when omitted), accept_gaps (tolerate gap symbols in the sequence parts), random_state (reproducibility), and verbose (progress messages):

labels = df_seq["label"].to_list()
df_scales = aa.load_scales(top_explain_n=20)
df_cat = aa.load_scales(name="scales_cat", top_explain_n=20)
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales, df_cat=df_cat,
             accept_gaps=False, random_state=0, verbose=False)
df_feat = cpp.run(labels=labels, n_filter=10)
aa.display_df(df_feat, n_rows=5, show_shape=True)

DataFrame shape: (10, 13)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	positions
1	TMD_JMD-Segment...,15)-FAUJ880104	Shape	Side chain length	Steric parameter	STERIMOL length...e et al., 1988)	0.382000	0.264000	0.264000	0.156000	0.156000	33,34
2	TMD_JMD-Segment...8,9)-HUTJ700103	Energy	Entropy	Entropy	Entropy of form...Hutchens, 1970)	0.378000	0.212000	0.212000	0.124000	0.135000	32,33,34,35
3	TMD_JMD-Pattern...,14)-CRAJ730103	Conformation	β-turn	β-turn	Normalized freq...d et al., 1973)	0.376000	0.281000	-0.281000	0.159000	0.180000	27,31
4	TMD_JMD-Segment...,13)-FAUJ880104	Shape	Side chain length	Steric parameter	STERIMOL length...e et al., 1988)	0.364000	0.275000	0.275000	0.172000	0.177000	31,32,33
5	TMD_JMD-Pattern...,14)-QIAN880107	Conformation	α-helix	α-helix (middle)	Weights for alp...ejnowski, 1988)	0.359000	0.199000	0.199000	0.112000	0.145000	27,31

Turn on bootstrap=True to annotate the selection with a stability score. The bootstrap is configured by one bootstrap_kws dict (parallel to split_kws) with keys rounds / resample / frac: each of rounds rounds resamples the data (resample chooses which class group is resampled: "reference" fixes the test group, or "both" / "test"; frac is the per-group draw size) and re-selects features, scoring how often each is selected. The ordinary run is then returned with a ``selection_frequency`` column (0 to 1) added — the selected features are exactly those of a normal run (n_filter stays the selection criterion), and selection_frequency flags which of them are reproducible under resampling. The mode applies to :meth:run, :meth:run_num, and :meth:run_composit alike.

Why these defaults? bootstrap=True uses bootstrap_kws=dict(rounds=20, resample="reference", frac=0.8) — a good starting point on the bundled DOM_GSEC data: ~20 rounds settles the selection_frequency estimate at a roughly linear cost (raise toward 50 for a touch more precision), 0.8 is the conventional resampling sub-sample size, and "reference" resamples only the usually larger, noisier reference group. Read selection_frequency to trust the reproducible features (near 1.0) over the sample-specific ones (near 0); the default bootstrap=False runs the ordinary single-pass selection.

cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales, df_cat=df_cat, random_state=0, verbose=False,
             bootstrap=True, bootstrap_kws=dict(rounds=20, resample="reference", frac=0.8))
df_feat = cpp.run(labels=labels, n_filter=10)
aa.display_df(df_feat, n_rows=5, show_shape=True)

DataFrame shape: (10, 14)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	positions	selection_frequency
1	TMD_JMD-Segment...,15)-FAUJ880104	Shape	Side chain length	Steric parameter	STERIMOL length...e et al., 1988)	0.382000	0.264000	0.264000	0.156000	0.156000	33,34	0.500000
2	TMD_JMD-Segment...8,9)-HUTJ700103	Energy	Entropy	Entropy	Entropy of form...Hutchens, 1970)	0.378000	0.212000	0.212000	0.124000	0.135000	32,33,34,35	0.450000
3	TMD_JMD-Pattern...,14)-CRAJ730103	Conformation	β-turn	β-turn	Normalized freq...d et al., 1973)	0.376000	0.281000	-0.281000	0.159000	0.180000	27,31	0.750000
4	TMD_JMD-Segment...,13)-FAUJ880104	Shape	Side chain length	Steric parameter	STERIMOL length...e et al., 1988)	0.364000	0.275000	0.275000	0.172000	0.177000	31,32,33	0.300000
5	TMD_JMD-Pattern...,14)-QIAN880107	Conformation	α-helix	α-helix (middle)	Weights for alp...ejnowski, 1988)	0.359000	0.199000	0.199000	0.112000	0.145000	27,31	0.150000

# Rank the selected features by how reproducible they are (selection_frequency near 1.0 = robust,
# near 0 = sample-specific) — the same top-n_filter features, now with a trust signal:
df_stable = df_feat[["feature", "abs_auc", "selection_frequency"]].sort_values(
    "selection_frequency", ascending=False)
aa.display_df(df_stable, n_rows=10, show_shape=True)

DataFrame shape: (10, 3)

	feature	abs_auc	selection_frequency
3	TMD_JMD-Pattern...,14)-CRAJ730103	0.376000	0.750000
1	TMD_JMD-Segment...,15)-FAUJ880104	0.382000	0.500000
2	TMD_JMD-Segment...8,9)-HUTJ700103	0.378000	0.450000
4	TMD_JMD-Segment...,13)-FAUJ880104	0.364000	0.300000
7	TMD_JMD-Pattern...,15)-RADA880107	0.354000	0.300000
9	TMD_JMD-Segment...5,5)-JANJ780101	0.353000	0.200000
5	TMD_JMD-Pattern...,14)-QIAN880107	0.359000	0.150000
6	TMD_JMD-Segment...,15)-LINS030101	0.354000	0.100000
8	TMD_JMD-Segment...8,9)-RADA880107	0.353000	0.100000
10	TMD_JMD-Pattern...,12)-JANJ780101	0.348000	0.050000