aaanalysis.CPP

class aaanalysis.CPP(df_parts=None, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]

Bases: Tool

Comparative Physicochemical Profiling (CPP) class to create and filter features that are most discriminant between two sets of sequences [Breimann25a].

CPP aims at identifying a set of non-redundant features that are most discriminant between the test and reference group of sequences.

Added in version 0.1.0.

df_parts

DataFrame with sequence Parts.

split_kws

Nested dictionary defining Splits with parameter dictionary for each chosen split_type.

df_scales

DataFrame with amino acid Scales.

df_cat

DataFrame with categories for physicochemical amino acid Scales.

Parameters:

Methods

eval([list_df_feat, labels, label_test, ...])

Evaluate the quality of different sets of identified CPP features.

run([labels, label_test, label_ref, ...])

Perform Comparative Physicochemical Profiling (CPP) algorithm: creation and two-step filtering of interpretable sequence-based features.

run_num([dict_num_parts, labels, ...])

Numerical-mode CPP: same algorithm as run(), but per-residue values come from a pre-sliced numerical tensor (dict_num_parts) instead of an AA→scale lookup.

__init__(df_parts=None, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]
Parameters:
  • df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts.

  • split_kws (dict, optional) – Dictionary with parameter dictionary for each chosen split_type. Default from SequenceFeature.get_split_kws().

  • df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].

  • df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from df_scales. Default from load_scales() with name='scales_cat', unless specified in options['df_cat'].

  • accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).

  • verbose (bool, default=True) – If True, verbose outputs are enabled.

  • random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random.

Notes

  • All scales from df_scales must be contained in df_cat

See also

  • CPPPlot: the respective plotting class.

  • SequenceFeature for definition of sequence Parts.

  • SequenceFeature.split_kws() for definition of Splits key word arguments.

  • load_scales() for definition of amino acid Scales and their categories.

Examples

To create an CPP object, you just need to provide a valid df_parts DataFrame:

import aaanalysis as aa
df_seq = aa.load_dataset(name="DOM_GSEC", n=50)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
# Create CPP object
cpp = aa.CPP(df_parts=df_parts)

You can adjust Parts, Splits, and Scales as follows:

df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd_jmd"])
split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=1)
df_scales = aa.load_scales()
scales = list(df_scales)[0:10]
# Create CPP object for Segments over the complete TMD-JMD with 10 first scales
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales[scales])