SequenceFeature.get_df_pos

static SequenceFeature.get_df_pos(df_feat, col_val='mean_dif', col_cat='category', start=1, tmd_len=20, jmd_n_len=10, jmd_c_len=10, list_parts=None, normalize=False)[source]

Create DataFrame of aggregated (mean or sum) feature values per residue position and scale.

Projects the per-feature statistics from a df_feat DataFrame (typically the output of CPP.run()) onto individual residue positions by spreading each feature’s value across every position its Split covers and then aggregating by scale category. The resulting position-by-category matrix is the direct input for CPPPlot position plots.

Added in version 0.1.0.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • col_val ({'abs_auc', 'abs_mean_dif', 'mean_dif', 'std_test', 'std_ref'}, default='mean_dif') – Column name in df_feat containing numerical values to average. If feature importance and impact are provided as {‘feat_importance’, ‘feat_impact’} columns, their sum of values is computed.

  • col_cat ({'category', 'subcategory', 'scale_name'}, default='category') – Column name in df_feat for categorizing the numerical values during aggregation.

  • start (int, default=1) – Position label of first residue position (starting at N-terminus).

  • tmd_len (int, default=20) – Length of target middle domain (TMD) (>0).

  • jmd_n_len (int, default=10) – Length of JMD-N (>=0).

  • jmd_c_len (int, default=10) – Length of JMD-C (>=0).

  • list_parts (str or list of str, optional) – Specific sequence parts to consider for numerical value aggregation.

  • normalize (bool, default=False) – If True, normalizes aggregated numerical values to a total of 100%.

Returns:

df_pos – DataFrame with aggregated numerical values per position.

Return type:

pd.DataFrame, shape (n_categories, n_positions)

Notes

  • Length parameters (tmd_len, jmd_n_len, jmd_c_len) must match with feature ids in df_feat.

Examples

To demonstrate the SequenceFeature().get_df_pos() method, we load the DOM_GSEC example dataset including its respective features (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_feat = sf.get_df_feat(features=features, labels=labels, df_parts=df_parts)
aa.display_df(df_feat, n_rows=5)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.301000 0.140000 0.140000 0.111692 0.110793 0.001116 0.004650 31,32,33,34,35
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.295000 0.129490 0.129490 0.111228 0.125451 0.001413 0.005048 31,32,33,34,35
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.335000 0.245149 0.245149 0.176567 0.182470 0.000289 0.004133 32,33
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.306000 0.155710 0.155710 0.104963 0.136006 0.000921 0.004605 31,32,33,34,35
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.342000 0.180850 0.180850 0.138541 0.145353 0.000211 0.005267 32,33

df_feat must be provided to create df_pos, containing an aggregated numerical value (mean_dif column by default) per a selected scale category level (category by default):

df_pos = sf.get_df_pos(df_feat=df_feat)
aa.display_df(df_pos, n_rows=5, show_shape=True)
DataFrame shape: (6, 40)
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
ASA/Volume 0.000000 0.000000 0.000000 0.000000 0.000000 -0.059101 -0.040276 -0.040276 -0.059101 -0.040276 0.000000 0.004078 0.000000 0.000000 0.004078 0.000000 0.000000 0.086081 0.000000 0.000000 0.086081 0.000000 0.000000 0.086081 0.000000 0.141517 0.089894 0.093706 0.093706 0.042851 0.145993 0.070463 0.047510 0.055056 0.055056 -0.016557 0.018667 0.000000 0.000000 0.187233
Conformation 0.026359 0.000000 0.002911 -0.012762 0.000000 0.002911 -0.051884 0.026359 -0.020537 0.026359 -0.035033 -0.065563 0.002911 -0.112500 -0.039593 0.177800 -0.056972 -0.019602 0.051167 -0.009796 -0.112065 0.051167 0.000000 0.015927 -0.054268 0.060008 -0.005440 0.021499 0.043311 0.026814 0.071730 0.100947 0.130135 0.142818 0.140699 0.144639 0.110996 0.103118 0.103118 0.103118
Energy 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.107934 0.000000 0.000000 -0.011404 0.000000 0.000000 0.010029 -0.107934 -0.072600 0.010029 -0.107934 -0.066433 -0.063350 -0.045898 -0.027333 -0.021444 0.023545 0.010077 0.070978 0.055025 0.015219 0.058613 0.045465 0.051147 -0.068855 0.121590 0.121590 0.121590
Polarity 0.000000 0.000000 0.000000 -0.122285 0.000000 -0.079610 -0.100947 -0.079610 -0.079610 -0.079610 -0.122285 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.140781 -0.127236 -0.127236 -0.134008 -0.133926 -0.127236 -0.067446 -0.101554 -0.107563 -0.012088 -0.012088 -0.140616 0.000000 0.000000 0.000000
Shape 0.096837 -0.082142 0.000000 0.046300 0.020332 -0.056744 -0.022184 0.025681 -0.031325 0.062349 0.007347 -0.125000 0.000000 -0.082142 -0.125000 0.000000 -0.082142 0.000000 0.083999 -0.082142 0.000000 0.083999 0.000000 0.000000 0.000000 0.083999 -0.052823 -0.052823 -0.052823 -0.052823 0.004098 0.034592 0.073556 0.085189 0.050950 0.190450 0.000000 0.153668 0.000000 0.000000

You can change the considered numerical and categorical columns using the col_val and col_cat parameters:

df_pos = sf.get_df_pos(df_feat=df_feat, col_val="abs_auc", col_cat="subcategory")
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (35, 40)
  1 2 3 4 5
Accessible surface area (ASA) 0.000000 0.000000 0.000000 0.000000 0.000000
Amphiphilicity 0.000000 0.000000 0.000000 0.000000 0.000000
Amphiphilicity (α-helix) 0.000000 0.000000 0.000000 0.000000 0.000000
Backbone-dynamics (-CH) 0.000000 0.000000 0.000000 0.000000 0.000000
Buried 0.000000 0.000000 0.000000 0.000000 0.000000

The residue positions can be adjusted using the start, tmd_len, jmd_n_len, and jmd_c_len parameters:

# Shift positions by 10 residues
df_pos = sf.get_df_pos(df_feat=df_feat, start=11)
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (6, 40)
  11 12 13 14 15
ASA/Volume 0.000000 0.004078 0.000000 0.000000 0.004078
Conformation -0.035033 -0.065563 0.002911 -0.112500 -0.039593
Energy 0.000000 0.000000 -0.107934 0.000000 0.000000
Polarity -0.122285 0.000000 0.000000 0.000000 0.000000
Shape 0.007347 -0.125000 0.000000 -0.082142 -0.125000
# Increase TMD length from 20 to 50
df_pos = sf.get_df_pos(df_feat=df_feat, tmd_len=50)
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)
DataFrame shape: (6, 70)
  1 2 3 4 5
ASA/Volume 0.000000 0.000000 0.000000 0.000000 0.000000
Conformation 0.026359 0.000000 0.002911 -0.012762 0.000000
Energy 0.000000 0.000000 0.000000 0.000000 0.000000
Polarity 0.000000 0.000000 0.000000 -0.122285 0.000000
Shape 0.096837 -0.082142 0.000000 0.046300 0.020332

You can select a specific parts and normalize results using the list_parts and normalize parameters:

df_pos = sf.get_df_pos(df_feat=df_feat, list_parts=["jmd_c"], normalize=True)
aa.display_df(df_pos)
  jmd_c
ASA/Volume 0.476542
Conformation 0.973787
Energy 0.501020
Polarity -0.480913
Shape 0.501140
Structure-Activity -0.117262