aaanalysis.SequenceFeature.get_df_feat

SequenceFeature.get_df_feat(features=None, df_parts=None, labels=None, label_test=1, label_ref=0, df_scales=None, df_cat=None, start=1, tmd_len=20, jmd_c_len=10, jmd_n_len=10, accept_gaps=False, parametric=False, n_jobs=1)[source]

Create feature DataFrame for given features.

Depending on the provided labels, the DataFrame is created for one of the three following cases:

  1. Group vs group comparison

  2. Sample vs group comparison

  3. Sample vs sample comparison

  • For the group vs group comparison, the general feature position will be provided.

  • For sample vs group or sample vs sample comparison, the amino acid segments and patterns for the respective sample from the test dataset (label = 1) will be given.

Parameters:
  • features (array-like, shape (n_features,)) – Ids of features for which df_feat should be created.

  • df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts. Must cover all parts in features.

  • labels (array-like, shape (n_samples,)) – Class labels for samples in df_parts. Should contain only two different integer label values, representing test and reference group (typically, 1 and 0).

  • label_test (int, default=1,) – Class label of test group in labels.

  • label_ref (int, default=0,) – Class label of reference group in labels.

  • df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].

  • df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from df_scales. Default from load_scales() with name='scales_cat', unless specified in options['df_cat'].

  • start (int, default=1) – Position label of first residue position (starting at N-terminus).

  • tmd_len (int, default=20) – Length of TMD (>0).

  • jmd_n_len (int, default=10) – Length of JMD-N (>=0).

  • jmd_c_len (int, default=10) – Length of JMD-C (>=0).

  • parametric (bool, default=False) – Whether to use parametric (T-test) or non-parametric (Mann-Whitney U test) test for p-value computation.

  • accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).

  • n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_feat – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

Return type:

pd.DataFrame, shape (n_features, n_feature_info)

Notes

  • Use parallel processing only for high number of features (>~1000 features per core)

  • For sample vs group or sample vs sample comparison, df_parts must comprise jmd_n, tmd, and jmd_c sequence parts as well as all parts in features.

See also

  • The CPP.run() method for creating and filtering CPP features for discriminating between two groups of sequences.

Examples

To demonstrate the SequenceFeature().get_df_feat() method, we load the DOM_GSEC example dataset including its respective features (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC")
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df_feat, n_rows=5)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000

features, df_parts, and the labels of the respective samples of the sequence DataFrame must be provided to retrieve the feature DataFrame:

# Mean difference values are higher because here negative samples (instead of unlabeled ones in Breimann25a) are used as a reference dataset
df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels)
aa.display_df(df_feat, n_rows=5)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.335000 0.168254 0.168254 0.106692 0.124924 0.000000 0.000000 31,32,33,34,35
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.333000 0.150698 0.150698 0.098774 0.119888 0.000000 0.000000 31,32,33,34,35
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.330000 0.246867 0.246867 0.161683 0.197489 0.000000 0.000000 32,33
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.327000 0.162229 0.162229 0.106865 0.135247 0.000000 0.000000 31,32,33,34,35
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.322000 0.184252 0.184252 0.114758 0.164757 0.000000 0.000000 32,33

You can adjust the provided labels of the test and reference group using label_test and label_ref, which will alter the sign in mean_dif:

df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, label_test=0, label_ref=1)
# Mean difference values display opposite signs because they represent the computed difference between the mean of the test group and the mean of the reference group
aa.display_df(df_feat, n_rows=5, col_to_show="mean_dif")
  mean_dif
1 -0.168254
2 -0.150698
3 -0.246867
4 -0.162229
5 -0.184252

The residue positions can be adjusted using the start, tmd_len, jmd_n_len, and jmd_c_len parameters:

# Shift positions by 10 residues
df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels,
                         start=11)
aa.display_df(df_feat, n_rows=5, col_to_show="positions")
  positions
1 41,42,43,44,45
2 41,42,43,44,45
3 42,43
4 41,42,43,44,45
5 42,43
# Increase TMD length from 20 to 50
df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels,
                         tmd_len=50)
aa.display_df(df_feat, n_rows=5, col_to_show="positions")
  positions
1 53,54,55,56,57,58,59,60,61
2 53,54,55,56,57,58,59,60,61
3 55,56,57,58
4 53,54,55,56,57,58,59,60,61
5 55,56,57,58

T-test can be used instead of Mann-Whitney-U-test by setting parameteric=True:

df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, parametric=True)
aa.display_df(df_feat, n_rows=5, col_to_show="p_val_ttest_indep")
  p_val_ttest_indep
1 0.000000
2 0.000000
3 0.000000
4 0.000000
5 0.000000