CPP
- class CPP(df_parts, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]
Bases:
ToolComparative Physicochemical Profiling (CPP) class to create and filter features that are most discriminant between two sets of sequences [Breimann25].
CPP aims at identifying a set of non-redundant features that are most discriminant between the test and reference group of sequences.
Added in version 0.1.0.
- df_parts
DataFrame with sequence Parts.
- split_kws
Nested dictionary defining Splits with parameter dictionary for each chosen split_type.
- df_scales
DataFrame with amino acid Scales.
- df_cat
DataFrame with categories for physicochemical amino acid Scales.
- Parameters:
Methods
eval(list_df_feat, labels[, label_test, ...])Evaluate the quality of different sets of identified Comparative Physicochemical Profiling (CPP) features.
run(labels[, label_test, label_ref, ...])Perform Comparative Physicochemical Profiling (CPP) algorithm: creation and two-step filtering of interpretable sequence-based features.
run_num(dict_num_parts, labels[, ...])Numerical-mode Comparative Physicochemical Profiling (CPP): same algorithm as
run(), but per-residue values come from a pre-sliced numerical tensor (dict_num_parts) instead of an AA→scale lookup.simplify(df_feat, labels[, strategy, ...])Simplify a feature set by swapping scales for more interpretable correlated ones.
- __init__(df_parts, split_kws=None, df_scales=None, df_cat=None, accept_gaps=False, verbose=True, random_state=None)[source]
- Parameters:
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts.
split_kws (dict, optional) – Dictionary with parameter dictionary for each chosen split_type. Default from
SequenceFeature.get_split_kws().df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from
load_scales()unless specified inoptions['df_scales'].df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from
df_scales. Default fromload_scales()withname='scales_cat', unless specified inoptions['df_cat'].accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if
True).verbose (bool, default=True) – If
True, verbose outputs are enabled.random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If
None, stochastic processes will be truly random.
Notes
All scales from
df_scalesmust be contained indf_catCPP is intrinsically binary (one test group vs one reference group). For multi-class or regression tasks, do not change CPP: transform the target into binary contrasts with the
SequenceFeature.get_labels_*helpers and looprun()(orrun_num()) over them. UseSequenceFeature.get_labels_ovr()/SequenceFeature.get_labels_ovo()for multi-class andSequenceFeature.get_labels_quantile()/SequenceFeature.get_labels_tiered()for regression. The row-dropping helpers (ovo/tiered) return the row-matcheddf_parts/dict_num_partsper contrast, ready to drop straight into a newCPP. See the P8: Prediction protocol for the end-to-end workflow.
See also
CPPPlot: the respective plotting class.SequenceFeaturefor definition of sequence Parts.SequenceFeature.split_kws()for definition of Splits key word arguments.load_scales()for definition of amino acid Scales and their categories.SequenceFeature.get_labels_*(e.g.SequenceFeature.get_labels_ovr()): build multi-class / regression label contrasts to drive CPP.
Examples
To create an
CPPobject, you just need to provide a validdf_partsDataFrame:import aaanalysis as aa df_seq = aa.load_dataset(name="DOM_GSEC", n=50) sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) # Create CPP object cpp = aa.CPP(df_parts=df_parts)
You can adjust Parts, Splits, and Scales as follows:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd_jmd"]) split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=1) df_scales = aa.load_scales() scales = list(df_scales)[0:10] # Create CPP object for Segments over the complete TMD-JMD with 10 first scales cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales[scales])
The
CPPconstructor also acceptsdf_cat(the scale-category table used by the redundancy filter and plotting; loaded automatically when omitted),accept_gaps(tolerate gap symbols in the sequence parts),random_state(reproducibility), andverbose(progress messages):labels = df_seq["label"].to_list() df_scales = aa.load_scales(top_explain_n=20) df_cat = aa.load_scales(name="scales_cat", top_explain_n=20) cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales, df_cat=df_cat, accept_gaps=False, random_state=0, verbose=False) df_feat = cpp.run(labels=labels, n_filter=10) aa.display_df(df_feat, n_rows=5, show_shape=True)
DataFrame shape: (10, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...,15)-FAUJ880104 Shape Side chain length Steric parameter STERIMOL length...e et al., 1988) 0.382000 0.264000 0.264000 0.156000 0.156000 0.000000 0.000000 33,34 2 TMD_JMD-Segment...8,9)-HUTJ700103 Energy Entropy Entropy Entropy of form...Hutchens, 1970) 0.378000 0.212000 0.212000 0.124000 0.135000 0.000000 0.000000 32,33,34,35 3 TMD_JMD-Pattern...,14)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.376000 0.281000 -0.281000 0.159000 0.180000 0.000000 0.000000 27,31 4 TMD_JMD-Segment...,13)-FAUJ880104 Shape Side chain length Steric parameter STERIMOL length...e et al., 1988) 0.364000 0.275000 0.275000 0.172000 0.177000 0.000000 0.000000 31,32,33 5 TMD_JMD-Pattern...,14)-QIAN880107 Conformation α-helix α-helix (middle) Weights for alp...ejnowski, 1988) 0.359000 0.199000 0.199000 0.112000 0.145000 0.000000 0.000000 27,31