aaanalysis.CPP
- class aaanalysis.CPP(df_parts=None, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]
Bases:
ToolComparative Physicochemical Profiling (CPP) class to create and filter features that are most discriminant between two sets of sequences [Breimann25a].
CPP aims at identifying a set of non-redundant features that are most discriminant between the test and reference group of sequences.
Added in version 0.1.0.
- df_parts
DataFrame with sequence Parts.
- split_kws
Nested dictionary defining Splits with parameter dictionary for each chosen split_type.
- df_scales
DataFrame with amino acid Scales.
- df_cat
DataFrame with categories for physicochemical amino acid Scales.
- Parameters:
Methods
eval([list_df_feat, labels, label_test, ...])Evaluate the quality of different sets of identified CPP features.
run([labels, label_test, label_ref, ...])Perform Comparative Physicochemical Profiling (CPP) algorithm: creation and two-step filtering of interpretable sequence-based features.
run_num([dict_num_parts, labels, ...])Numerical-mode CPP: same algorithm as
run(), but per-residue values come from a pre-sliced numerical tensor (dict_num_parts) instead of an AA→scale lookup.- __init__(df_parts=None, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]
- Parameters:
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts.
split_kws (dict, optional) – Dictionary with parameter dictionary for each chosen split_type. Default from
SequenceFeature.get_split_kws().df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from
load_scales()unless specified inoptions['df_scales'].df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from
df_scales. Default fromload_scales()withname='scales_cat', unless specified inoptions['df_cat'].accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if
True).verbose (bool, default=True) – If
True, verbose outputs are enabled.random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If
None, stochastic processes will be truly random.
Notes
All scales from
df_scalesmust be contained indf_cat
See also
CPPPlot: the respective plotting class.SequenceFeaturefor definition of sequence Parts.SequenceFeature.split_kws()for definition of Splits key word arguments.load_scales()for definition of amino acid Scales and their categories.
Examples
To create an
CPPobject, you just need to provide a validdf_partsDataFrame:import aaanalysis as aa df_seq = aa.load_dataset(name="DOM_GSEC", n=50) sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) # Create CPP object cpp = aa.CPP(df_parts=df_parts)
You can adjust Parts, Splits, and Scales as follows:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd_jmd"]) split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=1) df_scales = aa.load_scales() scales = list(df_scales)[0:10] # Create CPP object for Segments over the complete TMD-JMD with 10 first scales cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales[scales])