CPP

class CPP(df_parts, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]

Bases: Tool

Comparative Physicochemical Profiling (CPP) class to create and filter features that are most discriminant between two sets of sequences [Breimann25].

CPP aims at identifying a set of non-redundant features that are most discriminant between the test and reference group of sequences.

Added in version 0.1.0.

df_parts

DataFrame with sequence Parts.

split_kws

Nested dictionary defining Splits with parameter dictionary for each chosen split_type.

df_scales

DataFrame with amino acid Scales.

df_cat

DataFrame with categories for physicochemical amino acid Scales.

Parameters:

Methods

eval(list_df_feat, labels[, label_test, ...])

Evaluate the quality of different sets of identified Comparative Physicochemical Profiling (CPP) features.

run(labels[, label_test, label_ref, ...])

Perform Comparative Physicochemical Profiling (CPP) algorithm: creation and two-step filtering of interpretable sequence-based features.

run_num(dict_num_parts, labels[, ...])

Numerical-mode Comparative Physicochemical Profiling (CPP): same algorithm as run(), but per-residue values come from a pre-sliced numerical tensor (dict_num_parts) instead of an AA→scale lookup.

simplify(df_feat, labels[, strategy, ...])

Simplify a feature set by swapping scales for more interpretable correlated ones.

__init__(df_parts, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]
Parameters:
  • df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts.

  • split_kws (dict, optional) – Dictionary with parameter dictionary for each chosen split_type. Default from SequenceFeature.get_split_kws().

  • df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].

  • df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from df_scales. Default from load_scales() with name='scales_cat', unless specified in options['df_cat'].

  • accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).

  • verbose (bool, default=True) – If True, verbose outputs are enabled.

  • random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random.

Notes

See also

  • CPPPlot: the respective plotting class.

  • SequenceFeature for definition of sequence Parts.

  • SequenceFeature.split_kws() for definition of Splits key word arguments.

  • load_scales() for definition of amino acid Scales and their categories.

  • SequenceFeature.get_labels_* (e.g. SequenceFeature.get_labels_ovr()): build multi-class / regression label contrasts to drive CPP.

Examples

To create an CPP object, you just need to provide a valid df_parts DataFrame:

import aaanalysis as aa
df_seq = aa.load_dataset(name="DOM_GSEC", n=50)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
# Create CPP object
cpp = aa.CPP(df_parts=df_parts)

You can adjust Parts, Splits, and Scales as follows:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd_jmd"])
split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=1)
df_scales = aa.load_scales()
scales = list(df_scales)[0:10]
# Create CPP object for Segments over the complete TMD-JMD with 10 first scales
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales[scales])

The CPP constructor also accepts df_cat (the scale-category table used by the redundancy filter and plotting; loaded automatically when omitted), accept_gaps (tolerate gap symbols in the sequence parts), random_state (reproducibility), and verbose (progress messages):

labels = df_seq["label"].to_list()
df_scales = aa.load_scales(top_explain_n=20)
df_cat = aa.load_scales(name="scales_cat", top_explain_n=20)
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales, df_cat=df_cat,
             accept_gaps=False, random_state=0, verbose=False)
df_feat = cpp.run(labels=labels, n_filter=10)
aa.display_df(df_feat, n_rows=5, show_shape=True)
DataFrame shape: (10, 13)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions
1 TMD_JMD-Segment...,15)-FAUJ880104 Shape Side chain length Steric parameter STERIMOL length...e et al., 1988) 0.382000 0.264000 0.264000 0.156000 0.156000 0.000000 0.000000 33,34
2 TMD_JMD-Segment...8,9)-HUTJ700103 Energy Entropy Entropy Entropy of form...Hutchens, 1970) 0.378000 0.212000 0.212000 0.124000 0.135000 0.000000 0.000000 32,33,34,35
3 TMD_JMD-Pattern...,14)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.376000 0.281000 -0.281000 0.159000 0.180000 0.000000 0.000000 27,31
4 TMD_JMD-Segment...,13)-FAUJ880104 Shape Side chain length Steric parameter STERIMOL length...e et al., 1988) 0.364000 0.275000 0.275000 0.172000 0.177000 0.000000 0.000000 31,32,33
5 TMD_JMD-Pattern...,14)-QIAN880107 Conformation α-helix α-helix (middle) Weights for alp...ejnowski, 1988) 0.359000 0.199000 0.199000 0.112000 0.145000 0.000000 0.000000 27,31