aaanalysis.SequenceFeature.get_df_feat
- SequenceFeature.get_df_feat(features=None, df_parts=None, labels=None, label_test=1, label_ref=0, df_scales=None, df_cat=None, start=1, tmd_len=20, jmd_c_len=10, jmd_n_len=10, accept_gaps=False, parametric=False, n_jobs=1)[source]
Create feature DataFrame for given features.
Depending on the provided labels, the DataFrame is created for one of the three following cases:
Group vs group comparison
Sample vs group comparison
Sample vs sample comparison
For the group vs group comparison, the general feature position will be provided.
For sample vs group or sample vs sample comparison, the amino acid segments and patterns for the respective sample from the test dataset (label = 1) will be given.
- Parameters:
features (array-like, shape (n_features,)) – Ids of features for which
df_featshould be created.df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts. Must cover all parts in
features.labels (array-like, shape (n_samples,)) – Class labels for samples in
df_parts. Should contain only two different integer label values, representing test and reference group (typically, 1 and 0).label_test (int, default=1,) – Class label of test group in
labels.label_ref (int, default=0,) – Class label of reference group in
labels.df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from
load_scales()unless specified inoptions['df_scales'].df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from
df_scales. Default fromload_scales()withname='scales_cat', unless specified inoptions['df_cat'].start (int, default=1) – Position label of first residue position (starting at N-terminus).
tmd_len (int, default=20) – Length of TMD (>0).
jmd_n_len (int, default=10) – Length of JMD-N (>=0).
jmd_c_len (int, default=10) – Length of JMD-C (>=0).
parametric (bool, default=False) – Whether to use parametric (T-test) or non-parametric (Mann-Whitney U test) test for p-value computation.
accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).
n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.
- Returns:
df_feat – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
- Return type:
pd.DataFrame, shape (n_features, n_feature_info)
Notes
Use parallel processing only for high number of features (>~1000 features per core)
For sample vs group or sample vs sample comparison,
df_partsmust comprisejmd_n,tmd, andjmd_csequence parts as well as all parts in features.
See also
The
CPP.run()method for creating and filtering CPP features for discriminating between two groups of sequences.
Examples
To demonstrate the
SequenceFeature().get_df_feat()method, we load theDOM_GSECexample dataset including its respective features (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC") labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC") features = df_feat["feature"].to_list() sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) aa.display_df(df_feat, n_rows=5)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 features,df_parts, and thelabelsof the respective samples of the sequence DataFrame must be provided to retrieve the feature DataFrame:# Mean difference values are higher because here negative samples (instead of unlabeled ones in Breimann25a) are used as a reference dataset df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels) aa.display_df(df_feat, n_rows=5)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.335000 0.168254 0.168254 0.106692 0.124924 0.000000 0.000000 31,32,33,34,35 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.333000 0.150698 0.150698 0.098774 0.119888 0.000000 0.000000 31,32,33,34,35 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.330000 0.246867 0.246867 0.161683 0.197489 0.000000 0.000000 32,33 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.327000 0.162229 0.162229 0.106865 0.135247 0.000000 0.000000 31,32,33,34,35 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.322000 0.184252 0.184252 0.114758 0.164757 0.000000 0.000000 32,33 You can adjust the provided labels of the test and reference group using
label_testandlabel_ref, which will alter the sign inmean_dif:df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, label_test=0, label_ref=1) # Mean difference values display opposite signs because they represent the computed difference between the mean of the test group and the mean of the reference group aa.display_df(df_feat, n_rows=5, col_to_show="mean_dif")
mean_dif 1 -0.168254 2 -0.150698 3 -0.246867 4 -0.162229 5 -0.184252 The residue positions can be adjusted using the
start,tmd_len,jmd_n_len, andjmd_c_lenparameters:# Shift positions by 10 residues df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, start=11) aa.display_df(df_feat, n_rows=5, col_to_show="positions")positions 1 41,42,43,44,45 2 41,42,43,44,45 3 42,43 4 41,42,43,44,45 5 42,43 # Increase TMD length from 20 to 50 df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, tmd_len=50) aa.display_df(df_feat, n_rows=5, col_to_show="positions")positions 1 53,54,55,56,57,58,59,60,61 2 53,54,55,56,57,58,59,60,61 3 55,56,57,58 4 53,54,55,56,57,58,59,60,61 5 55,56,57,58 T-test can be used instead of Mann-Whitney-U-test by setting
parameteric=True:df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, parametric=True) aa.display_df(df_feat, n_rows=5, col_to_show="p_val_ttest_indep")
p_val_ttest_indep 1 0.000000 2 0.000000 3 0.000000 4 0.000000 5 0.000000