aaanalysis.SequenceFeature
- class aaanalysis.SequenceFeature(verbose=True)[source]
Bases:
objectUtility feature engineering class using sequences to create
CPPfeature components (Parts, Splits, and Scales) and data structures [Breimann25a].The three feature components are the primary input for the
aaanalysis.CPPclass and define Comparative Physicochemical Profiling (CPP) features.Added in version 0.1.0.
Notes
- Feature Components:
Part: A continuous subset of a sequence, such as a protein domain.
Split: Continuous or discontinuous subset of a Part, either segment or pattern.
Scale: A physicochemical scale, i.e., a set of numerical values (typically [0-1]) assigned to amino acids.
- Main Parts:
We define three main parts from which each other part can be derived from:
TMD (target middle domain): Protein domain of interest with varying length.
JMD-N (juxta middle domain N-terminal): Protein domain or sequence region directly N-terminally next to the TMD, typically set to a fixed length (10 by default).
JMD-C (juxta middle domain C-terminal): Protein domain or sequence region directly C-terminally next to the TMD, typically set to a fixed length (10 by default).
- Feature: Part + Split + Scale
Physicochemical property (expressed as numerical scale) present at distinct amino acid positions within a protein sequence. The positions are obtained by splitting sequence parts into segments or patterns.
- Feature value: Realization of a Feature
For a given sequence, a feature value is the average of a physicochemical scale over all amino acids obtained by splitting a sequence part.
- Valid sequence parts:
tmd: Target Middle Domain (TMD).tmd_e: TMD extended N- and C-terminally by a number of residues, defined by theext_lenconfiguration option.tmd_n: N-terminal half of the TMD.tmd_c: C-terminal half of the TMD.jmd_n: N-terminal Juxt Middle Domain (JMD).jmd_c: C-terminal JMD.ext_c: Extended C-terminal region.ext_n: Extended N-terminal region.tmd_jmd: Combination of JMD-N, TMD, and JMD-C.jmd_n_tmd_n: Combination of JMD-N and N-terminal half of TMD.tmd_c_jmd_c: Combination of C-terminal half of TMD and JMD-C.ext_n_tmd_n: Extended N-terminal region and N-terminal half of TMD.tmd_c_ext_c: C-terminal half of TMD and extended C-terminal region.
- Default parts:
The following three parts are provided by default:
tmd,jmd_n_tmd_n,tmd_c_jmd_c.
- Parameters:
verbose (
bool)
Methods
feature_matrix([features, df_parts, ...])Create feature matrix for given feature ids and sequence parts.
get_df_feat([features, df_parts, labels, ...])Create feature DataFrame for given features.
get_df_parts([df_seq, list_parts, ...])Create DataFrane with selected sequence parts.
get_df_pos([df_feat, col_val, col_cat, ...])Create DataFrame of aggregated (mean or sum) feature values per residue position and scale.
get_feature_names([features, df_cat, start, ...])Convert feature ids (PART-SPLIT-SCALE) into feature names (scale name [positions]).
get_feature_positions([features, start, ...])Create for features a list of corresponding positions or amino acids.
get_features([list_parts, all_parts, ...])Create list of all feature ids for given Parts, Splits, and Scales.
get_split_kws([split_types, n_split_min, ...])Create dictionary with kwargs for three split types: