aaanalysis.SequenceFeature.get_df_pos
- static SequenceFeature.get_df_pos(df_feat=None, col_val='mean_dif', col_cat='category', start=1, tmd_len=20, jmd_n_len=10, jmd_c_len=10, list_parts=None, normalize=False)[source]
Create DataFrame of aggregated (mean or sum) feature values per residue position and scale.
- Parameters:
df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
col_val ({'abs_auc', 'abs_mean_dif', 'mean_dif', 'std_test', 'std_ref'}, default='mean_dif') – Column name in
df_featcontaining numerical values toaverage. If feature importance and impact are provided as {‘feat_importance’, ‘feat_impact’} columns, theirsumof values is computed.col_cat ({'category', 'subcategory', 'scale_name'}, default='category') – Column name in
df_featfor categorizing the numerical values during aggregation.start (int, default=1) – Position label of first residue position (starting at N-terminus).
tmd_len (int, default=20) – Length of TMD (>0).
jmd_n_len (int, default=10) – Length of JMD-N (>=0).
jmd_c_len (int, default=10) – Length of JMD-C (>=0).
list_parts (str or list of str, optional) – Specific sequence parts to consider for numerical value aggregation.
normalize (bool, default=False) – If
True, normalizes aggregated numerical values to a total of 100%.
- Returns:
df_pos – DataFrame with aggregated numerical values per position.
- Return type:
pd.DataFrame, shape (n_categories, n_positions)
Notes
Length parameters (
tmd_len,jmd_n_len,jmd_c_len) must match with feature ids indf_feat.
Examples
To demonstrate the
SequenceFeature().get_df_pos()method, we load theDOM_GSECexample dataset including its respective features (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC", n=20) labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(100) features = df_feat["feature"].to_list() sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) df_feat = sf.get_df_feat(features=features, labels=labels, df_parts=df_parts) aa.display_df(df_feat, n_rows=5)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.301000 0.140000 0.140000 0.111692 0.110793 0.001116 0.004650 31,32,33,34,35 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.295000 0.129490 0.129490 0.111228 0.125451 0.001413 0.005048 31,32,33,34,35 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.335000 0.245149 0.245149 0.176567 0.182470 0.000289 0.004133 32,33 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.306000 0.155710 0.155710 0.104963 0.136006 0.000921 0.004605 31,32,33,34,35 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.342000 0.180850 0.180850 0.138541 0.145353 0.000211 0.005267 32,33 df_featmust be provided to createdf_pos, containing an aggregated numerical value (mean_difcolumn by default) per a selected scale category level (categoryby default):df_pos = sf.get_df_pos(df_feat=df_feat) aa.display_df(df_pos, n_rows=5, show_shape=True)
DataFrame shape: (6, 40)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 ASA/Volume 0.000000 0.000000 0.000000 0.000000 0.000000 -0.059101 -0.040276 -0.040276 -0.059101 -0.040276 0.000000 0.004078 0.000000 0.000000 0.004078 0.000000 0.000000 0.086081 0.000000 0.000000 0.086081 0.000000 0.000000 0.086081 0.000000 0.141517 0.089894 0.093706 0.093706 0.042851 0.145993 0.070463 0.047510 0.055056 0.055056 -0.016557 0.018667 0.000000 0.000000 0.187233 Conformation 0.026359 0.000000 0.002911 -0.012762 0.000000 0.002911 -0.051884 0.026359 -0.020537 0.026359 -0.035033 -0.065563 0.002911 -0.112500 -0.039593 0.177800 -0.056972 -0.019602 0.051167 -0.009796 -0.112065 0.051167 0.000000 0.015927 -0.054268 0.060008 -0.005440 0.021499 0.043311 0.026814 0.071730 0.100947 0.130135 0.142818 0.140699 0.144639 0.110996 0.103118 0.103118 0.103118 Energy 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.107934 0.000000 0.000000 -0.011404 0.000000 0.000000 0.010029 -0.107934 -0.072600 0.010029 -0.107934 -0.066433 -0.063350 -0.045898 -0.027333 -0.021444 0.023545 0.010077 0.070978 0.055025 0.015219 0.058613 0.045465 0.051147 -0.068855 0.121590 0.121590 0.121590 Polarity 0.000000 0.000000 0.000000 -0.122285 0.000000 -0.079610 -0.100947 -0.079610 -0.079610 -0.079610 -0.122285 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.140781 -0.127236 -0.127236 -0.134008 -0.133926 -0.127236 -0.067446 -0.101554 -0.107563 -0.012088 -0.012088 -0.140616 0.000000 0.000000 0.000000 Shape 0.096837 -0.082142 0.000000 0.046300 0.020332 -0.056744 -0.022184 0.025681 -0.031325 0.062349 0.007347 -0.125000 0.000000 -0.082142 -0.125000 0.000000 -0.082142 0.000000 0.083999 -0.082142 0.000000 0.083999 0.000000 0.000000 0.000000 0.083999 -0.052823 -0.052823 -0.052823 -0.052823 0.004098 0.034592 0.073556 0.085189 0.050950 0.190450 0.000000 0.153668 0.000000 0.000000 You can change the considered numerical and categorical columns using the
col_valandcol_catparameters:df_pos = sf.get_df_pos(df_feat=df_feat, col_val="abs_auc", col_cat="subcategory") aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (35, 40)
1 2 3 4 5 Accessible surface area (ASA) 0.000000 0.000000 0.000000 0.000000 0.000000 Amphiphilicity 0.000000 0.000000 0.000000 0.000000 0.000000 Amphiphilicity (α-helix) 0.000000 0.000000 0.000000 0.000000 0.000000 Backbone-dynamics (-CH) 0.000000 0.000000 0.000000 0.000000 0.000000 Buried 0.000000 0.000000 0.000000 0.000000 0.000000 The residue positions can be adjusted using the
start,tmd_len,jmd_n_len, andjmd_c_lenparameters:# Shift positions by 10 residues df_pos = sf.get_df_pos(df_feat=df_feat, start=11) aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (6, 40)
11 12 13 14 15 ASA/Volume 0.000000 0.004078 0.000000 0.000000 0.004078 Conformation -0.035033 -0.065563 0.002911 -0.112500 -0.039593 Energy 0.000000 0.000000 -0.107934 0.000000 0.000000 Polarity -0.122285 0.000000 0.000000 0.000000 0.000000 Shape 0.007347 -0.125000 0.000000 -0.082142 -0.125000 # Increase TMD length from 20 to 50 df_pos = sf.get_df_pos(df_feat=df_feat, tmd_len=50) aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (6, 70)
1 2 3 4 5 ASA/Volume 0.000000 0.000000 0.000000 0.000000 0.000000 Conformation 0.026359 0.000000 0.002911 -0.012762 0.000000 Energy 0.000000 0.000000 0.000000 0.000000 0.000000 Polarity 0.000000 0.000000 0.000000 -0.122285 0.000000 Shape 0.096837 -0.082142 0.000000 0.046300 0.020332 You can select a specific parts and normalize results using the
list_partsandnormalizeparameters:df_pos = sf.get_df_pos(df_feat=df_feat, list_parts=["jmd_c"], normalize=True) aa.display_df(df_pos)
jmd_c ASA/Volume 0.476542 Conformation 0.973787 Energy 0.501020 Polarity -0.480913 Shape 0.501140 Structure-Activity -0.117262