aaanalysis.SequenceFeature

class aaanalysis.SequenceFeature(verbose=True)[source]

Bases: object

Utility feature engineering class using sequences to create CPP feature components (Parts, Splits, and Scales) and data structures [Breimann25a].

The three feature components are the primary input for the aaanalysis.CPP class and define Comparative Physicochemical Profiling (CPP) features.

Added in version 0.1.0.

Notes

Feature Components:
  • Part: A continuous subset of a sequence, such as a protein domain.

  • Split: Continuous or discontinuous subset of a Part, either segment or pattern.

  • Scale: A physicochemical scale, i.e., a set of numerical values (typically [0-1]) assigned to amino acids.

Main Parts:

We define three main parts from which each other part can be derived from:

  • TMD (target middle domain): Protein domain of interest with varying length.

  • JMD-N (juxta middle domain N-terminal): Protein domain or sequence region directly N-terminally next to the TMD, typically set to a fixed length (10 by default).

  • JMD-C (juxta middle domain C-terminal): Protein domain or sequence region directly C-terminally next to the TMD, typically set to a fixed length (10 by default).

Feature: Part + Split + Scale

Physicochemical property (expressed as numerical scale) present at distinct amino acid positions within a protein sequence. The positions are obtained by splitting sequence parts into segments or patterns.

Feature value: Realization of a Feature

For a given sequence, a feature value is the average of a physicochemical scale over all amino acids obtained by splitting a sequence part.

Valid sequence parts:
  • tmd: Target Middle Domain (TMD).

  • tmd_e: TMD extended N- and C-terminally by a number of residues, defined by the ext_len configuration option.

  • tmd_n: N-terminal half of the TMD.

  • tmd_c: C-terminal half of the TMD.

  • jmd_n: N-terminal Juxt Middle Domain (JMD).

  • jmd_c: C-terminal JMD.

  • ext_c: Extended C-terminal region.

  • ext_n: Extended N-terminal region.

  • tmd_jmd: Combination of JMD-N, TMD, and JMD-C.

  • jmd_n_tmd_n: Combination of JMD-N and N-terminal half of TMD.

  • tmd_c_jmd_c: Combination of C-terminal half of TMD and JMD-C.

  • ext_n_tmd_n: Extended N-terminal region and N-terminal half of TMD.

  • tmd_c_ext_c: C-terminal half of TMD and extended C-terminal region.

Default parts:

The following three parts are provided by default: tmd, jmd_n_tmd_n, tmd_c_jmd_c.

Parameters:

verbose (bool)

Methods

feature_matrix([features, df_parts, ...])

Create feature matrix for given feature ids and sequence parts.

get_df_feat([features, df_parts, labels, ...])

Create feature DataFrame for given features.

get_df_parts([df_seq, list_parts, ...])

Create DataFrane with selected sequence parts.

get_df_pos([df_feat, col_val, col_cat, ...])

Create DataFrame of aggregated (mean or sum) feature values per residue position and scale.

get_feature_names([features, df_cat, start, ...])

Convert feature ids (PART-SPLIT-SCALE) into feature names (scale name [positions]).

get_feature_positions([features, start, ...])

Create for features a list of corresponding positions or amino acids.

get_features([list_parts, all_parts, ...])

Create list of all feature ids for given Parts, Splits, and Scales.

get_split_kws([split_types, n_split_min, ...])

Create dictionary with kwargs for three split types:

__init__(verbose=True)[source]
Parameters:

verbose (bool, default=True) – If True, verbose outputs are enabled.