aaanalysis.SequenceFeature.get_df_pos

static SequenceFeature.get_df_pos(df_feat=None, col_val='mean_dif', col_cat='category', start=1, tmd_len=20, jmd_n_len=10, jmd_c_len=10, list_parts=None, normalize=False)[source]

Create DataFrame of aggregated (mean or sum) feature values per residue position and scale.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • col_val ({'abs_auc', 'abs_mean_dif', 'mean_dif', 'std_test', 'std_ref'}, default='mean_dif') – Column name in df_feat containing numerical values to average. If feature importance and impact are provided as {‘feat_importance’, ‘feat_impact’} columns, their sum of values is computed.

  • col_cat ({'category', 'subcategory', 'scale_name'}, default='category') – Column name in df_feat for categorizing the numerical values during aggregation.

  • start (int, default=1) – Position label of first residue position (starting at N-terminus).

  • tmd_len (int, default=20) – Length of TMD (>0).

  • jmd_n_len (int, default=10) – Length of JMD-N (>=0).

  • jmd_c_len (int, default=10) – Length of JMD-C (>=0).

  • list_parts (str or list of str, optional) – Specific sequence parts to consider for numerical value aggregation.

  • normalize (bool, default=False) – If True, normalizes aggregated numerical values to a total of 100%.

Returns:

df_pos – DataFrame with aggregated numerical values per position.

Return type:

pd.DataFrame, shape (n_categories, n_positions)

Notes

  • Length parameters (tmd_len, jmd_n_len, jmd_c_len) must match with feature ids in df_feat.

Examples

To demonstrate the SequenceFeature().get_df_pos() method, we load the DOM_GSEC example dataset including its respective features (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_feat = sf.get_df_feat(features=features, labels=labels, df_parts=df_parts)
aa.display_df(df_feat, n_rows=5)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.301000 0.140000 0.140000 0.111692 0.110793 0.001116 0.004650 31,32,33,34,35
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.295000 0.129490 0.129490 0.111228 0.125451 0.001413 0.005048 31,32,33,34,35
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.335000 0.245149 0.245149 0.176567 0.182470 0.000289 0.004133 32,33
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.306000 0.155710 0.155710 0.104963 0.136006 0.000921 0.004605 31,32,33,34,35
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.342000 0.180850 0.180850 0.138541 0.145353 0.000211 0.005267 32,33

df_feat must be provided to create df_pos, containing an aggregated numerical value (mean_dif column by default) per a selected scale category level (category by default):

df_pos = sf.get_df_pos(df_feat=df_feat)
aa.display_df(df_pos, n_rows=5, show_shape=True)
DataFrame shape: (6, 40)
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
ASA/Volume 0.000000 0.000000 0.000000 0.000000 0.000000 -0.059101 -0.040276 -0.040276 -0.059101 -0.040276 0.000000 0.004078 0.000000 0.000000 0.004078 0.000000 0.000000 0.086081 0.000000 0.000000 0.086081 0.000000 0.000000 0.086081 0.000000 0.141517 0.089894 0.093706 0.093706 0.042851 0.145993 0.070463 0.047510 0.055056 0.055056 -0.016557 0.018667 0.000000 0.000000 0.187233
Conformation 0.026359 0.000000 0.002911 -0.012762 0.000000 0.002911 -0.051884 0.026359 -0.020537 0.026359 -0.035033 -0.065563 0.002911 -0.112500 -0.039593 0.177800 -0.056972 -0.019602 0.051167 -0.009796 -0.112065 0.051167 0.000000 0.015927 -0.054268 0.060008 -0.005440 0.021499 0.043311 0.026814 0.071730 0.100947 0.130135 0.142818 0.140699 0.144639 0.110996 0.103118 0.103118 0.103118
Energy 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.107934 0.000000 0.000000 -0.011404 0.000000 0.000000 0.010029 -0.107934 -0.072600 0.010029 -0.107934 -0.066433 -0.063350 -0.045898 -0.027333 -0.021444 0.023545 0.010077 0.070978 0.055025 0.015219 0.058613 0.045465 0.051147 -0.068855 0.121590 0.121590 0.121590
Polarity 0.000000 0.000000 0.000000 -0.122285 0.000000 -0.079610 -0.100947 -0.079610 -0.079610 -0.079610 -0.122285 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.140781 -0.127236 -0.127236 -0.134008 -0.133926 -0.127236 -0.067446 -0.101554 -0.107563 -0.012088 -0.012088 -0.140616 0.000000 0.000000 0.000000
Shape 0.096837 -0.082142 0.000000 0.046300 0.020332 -0.056744 -0.022184 0.025681 -0.031325 0.062349 0.007347 -0.125000 0.000000 -0.082142 -0.125000 0.000000 -0.082142 0.000000 0.083999 -0.082142 0.000000 0.083999 0.000000 0.000000 0.000000 0.083999 -0.052823 -0.052823 -0.052823 -0.052823 0.004098 0.034592 0.073556 0.085189 0.050950 0.190450 0.000000 0.153668 0.000000 0.000000

You can change the considered numerical and categorical columns using the col_val and col_cat parameters:

df_pos = sf.get_df_pos(df_feat=df_feat, col_val="abs_auc", col_cat="subcategory")
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (35, 40)
  1 2 3 4 5
Accessible surface area (ASA) 0.000000 0.000000 0.000000 0.000000 0.000000
Amphiphilicity 0.000000 0.000000 0.000000 0.000000 0.000000
Amphiphilicity (α-helix) 0.000000 0.000000 0.000000 0.000000 0.000000
Backbone-dynamics (-CH) 0.000000 0.000000 0.000000 0.000000 0.000000
Buried 0.000000 0.000000 0.000000 0.000000 0.000000

The residue positions can be adjusted using the start, tmd_len, jmd_n_len, and jmd_c_len parameters:

# Shift positions by 10 residues
df_pos = sf.get_df_pos(df_feat=df_feat, start=11)
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (6, 40)
  11 12 13 14 15
ASA/Volume 0.000000 0.004078 0.000000 0.000000 0.004078
Conformation -0.035033 -0.065563 0.002911 -0.112500 -0.039593
Energy 0.000000 0.000000 -0.107934 0.000000 0.000000
Polarity -0.122285 0.000000 0.000000 0.000000 0.000000
Shape 0.007347 -0.125000 0.000000 -0.082142 -0.125000
# Increase TMD length from 20 to 50
df_pos = sf.get_df_pos(df_feat=df_feat, tmd_len=50)
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (6, 70)
  1 2 3 4 5
ASA/Volume 0.000000 0.000000 0.000000 0.000000 0.000000
Conformation 0.026359 0.000000 0.002911 -0.012762 0.000000
Energy 0.000000 0.000000 0.000000 0.000000 0.000000
Polarity 0.000000 0.000000 0.000000 -0.122285 0.000000
Shape 0.096837 -0.082142 0.000000 0.046300 0.020332

You can select a specific parts and normalize results using the list_parts and normalize parameters:

df_pos = sf.get_df_pos(df_feat=df_feat, list_parts=["jmd_c"], normalize=True)
aa.display_df(df_pos)
  jmd_c
ASA/Volume 0.476542
Conformation 0.973787
Energy 0.501020
Polarity -0.480913
Shape 0.501140
Structure-Activity -0.117262