SequenceFeature.get_df_pos

static SequenceFeature.get_df_pos(df_feat, col_val='mean_dif', col_cat='category', start=1, tmd_len=20, jmd_n_len=10, jmd_c_len=10, list_parts=None, normalize=False)[source]

Create DataFrame of aggregated (mean or sum) feature values per residue position and scale.

Projects the per-feature statistics from a df_feat DataFrame (typically the output of CPP.run()) onto individual residue positions by spreading each feature’s value across every position its Split covers and then aggregating by scale category. The resulting position-by-category matrix is the direct input for CPPPlot position plots.

Added in version 0.1.0.

Parameters:

df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
col_val ({'abs_auc', 'abs_mean_dif', 'mean_dif', 'std_test', 'std_ref'}, default='mean_dif') – Column name in df_feat containing numerical values to average. If feature importance and impact are provided as {‘feat_importance’, ‘feat_impact’} columns, their sum of values is computed.
col_cat ({'category', 'subcategory', 'scale_name'}, default='category') – Column name in df_feat for categorizing the numerical values during aggregation.
start (int, default=1) – Position label of first residue position (starting at N-terminus).
tmd_len (int, default=20) – Length of target middle domain (TMD) (>0).
jmd_n_len (int, default=10) – Length of JMD-N (>=0).
jmd_c_len (int, default=10) – Length of JMD-C (>=0).
list_parts (str or list of str, optional) – Specific sequence parts to consider for numerical value aggregation.
normalize (bool, default=False) – If True, normalizes aggregated numerical values to a total of 100%.

Returns:

df_pos – DataFrame with aggregated numerical values per position.

Return type:

pd.DataFrame, shape (n_categories, n_positions)

Notes

Length parameters (tmd_len, jmd_n_len, jmd_c_len) must match with feature ids in df_feat.

Examples

To demonstrate the SequenceFeature().get_df_pos() method, we load the DOM_GSEC example dataset including its respective features (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_feat = sf.get_df_feat(features=features, labels=labels, df_parts=df_parts)
aa.display_df(df_feat, n_rows=5)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions
1	TMD_C_JMD_C-Seg...3,4)-KLEP840101	Energy	Charge	Charge	Net charge (Kle...n et al., 1984)	0.301000	0.140000	0.140000	0.111692	0.110793	0.001116	0.004628	31,32,33,34,35
2	TMD_C_JMD_C-Seg...3,4)-FINA910104	Conformation	α-helix (C-cap)	α-helix termination	Helix terminati...n et al., 1991)	0.295000	0.129490	0.129490	0.111228	0.125451	0.001413	0.004628	31,32,33,34,35
3	TMD_C_JMD_C-Seg...6,9)-LEVM760105	Shape	Side chain length	Side chain length	Radius of gyrat... (Levitt, 1976)	0.335000	0.245149	0.245149	0.176567	0.182470	0.000289	0.003457	32,33
4	TMD_C_JMD_C-Seg...3,4)-HUTJ700102	Energy	Entropy	Entropy	Absolute entrop...Hutchens, 1970)	0.306000	0.155710	0.155710	0.104963	0.136006	0.000921	0.004393	31,32,33,34,35
5	TMD_C_JMD_C-Seg...6,9)-RADA880106	ASA/Volume	Volume	Accessible surface area (ASA)	Accessible surf...olfenden, 1988)	0.342000	0.180850	0.180850	0.138541	0.145353	0.000211	0.003457	32,33

df_feat must be provided to create df_pos, containing an aggregated numerical value (mean_dif column by default) per a selected scale category level (category by default):

df_pos = sf.get_df_pos(df_feat=df_feat)
aa.display_df(df_pos, n_rows=5, show_shape=True)

DataFrame shape: (6, 40)

	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20	21	22	23	24	25	26	27	28	29	30	31	32	33	34	35	36	37	38	39	40
ASA/Volume	0.000000	0.000000	0.000000	0.000000	0.000000	-0.059101	-0.040276	-0.040276	-0.059101	-0.040276	0.000000	0.004078	0.000000	0.000000	0.004078	0.000000	0.000000	0.086081	0.000000	0.000000	0.086081	0.000000	0.000000	0.086081	0.000000	0.141517	0.089894	0.093706	0.093706	0.042851	0.145993	0.070463	0.047510	0.055056	0.055056	-0.016557	0.018667	0.000000	0.000000	0.187233
Conformation	0.026359	0.000000	0.002911	-0.012762	0.000000	0.002911	-0.051884	0.026359	-0.020537	0.026359	-0.035033	-0.065563	0.002911	-0.112500	-0.039593	0.177800	-0.056972	-0.019602	0.051167	-0.009796	-0.112065	0.051167	0.000000	0.015927	-0.054268	0.060008	-0.005440	0.021499	0.043311	0.026814	0.071730	0.100947	0.130135	0.142818	0.140699	0.144639	0.110996	0.103118	0.103118	0.103118
Energy	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.107934	0.000000	0.000000	-0.011404	0.000000	0.000000	0.010029	-0.107934	-0.072600	0.010029	-0.107934	-0.066433	-0.063350	-0.045898	-0.027333	-0.021444	0.023545	0.010077	0.070978	0.055025	0.015219	0.058613	0.045465	0.051147	-0.068855	0.121590	0.121590	0.121590
Polarity	0.000000	0.000000	0.000000	-0.122285	0.000000	-0.079610	-0.100947	-0.079610	-0.079610	-0.079610	-0.122285	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	-0.140781	-0.127236	-0.127236	-0.134008	-0.133926	-0.127236	-0.067446	-0.101554	-0.107563	-0.012088	-0.012088	-0.140616	0.000000	0.000000	0.000000
Shape	0.096837	-0.082142	0.000000	0.046300	0.020332	-0.056744	-0.022184	0.025681	-0.031325	0.062349	0.007347	-0.125000	0.000000	-0.082142	-0.125000	0.000000	-0.082142	0.000000	0.083999	-0.082142	0.000000	0.083999	0.000000	0.000000	0.000000	0.083999	-0.052823	-0.052823	-0.052823	-0.052823	0.004098	0.034592	0.073556	0.085189	0.050950	0.190450	0.000000	0.153668	0.000000	0.000000

You can change the considered numerical and categorical columns using the col_val and col_cat parameters:

df_pos = sf.get_df_pos(df_feat=df_feat, col_val="abs_auc", col_cat="subcategory")
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)

DataFrame shape: (35, 40)

	1	2	3	4	5
Accessible surface area (ASA)	0.000000	0.000000	0.000000	0.000000	0.000000
Amphiphilicity	0.000000	0.000000	0.000000	0.000000	0.000000
Amphiphilicity (α-helix)	0.000000	0.000000	0.000000	0.000000	0.000000
Backbone-dynamics (-CH)	0.000000	0.000000	0.000000	0.000000	0.000000
Buried	0.000000	0.000000	0.000000	0.000000	0.000000

The residue positions can be adjusted using the start, tmd_len, jmd_n_len, and jmd_c_len parameters:

# Shift positions by 10 residues
df_pos = sf.get_df_pos(df_feat=df_feat, start=11)
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)

DataFrame shape: (6, 40)

	11	12	13	14	15
ASA/Volume	0.000000	0.004078	0.000000	0.000000	0.004078
Conformation	-0.035033	-0.065563	0.002911	-0.112500	-0.039593
Energy	0.000000	0.000000	-0.107934	0.000000	0.000000
Polarity	-0.122285	0.000000	0.000000	0.000000	0.000000
Shape	0.007347	-0.125000	0.000000	-0.082142	-0.125000

# Increase TMD length from 20 to 50
df_pos = sf.get_df_pos(df_feat=df_feat, tmd_len=50)
aa.display_df(df_pos, n_rows=5, show_shape=True, n_cols=5)

DataFrame shape: (6, 70)

	1	2	3	4	5
ASA/Volume	0.000000	0.000000	0.000000	0.000000	0.000000
Conformation	0.026359	0.000000	0.002911	-0.012762	0.000000
Energy	0.000000	0.000000	0.000000	0.000000	0.000000
Polarity	0.000000	0.000000	0.000000	-0.122285	0.000000
Shape	0.096837	-0.082142	0.000000	0.046300	0.020332

You can select a specific parts and normalize results using the list_parts and normalize parameters:

df_pos = sf.get_df_pos(df_feat=df_feat, list_parts=["jmd_c"], normalize=True)
aa.display_df(df_pos)

	jmd_c
ASA/Volume	0.476542
Conformation	0.973787
Energy	0.501020
Polarity	-0.480913
Shape	0.501140
Structure-Activity	-0.117262

Further parameters. The JMD flank lengths are set with jmd_n_len and jmd_c_len; they parameterize how the residue positions map onto the JMD-TMD-JMD layout and must match the geometry encoded in the feature ids:

df_pos = sf.get_df_pos(df_feat=df_feat, jmd_n_len=10, jmd_c_len=10)
aa.display_df(df_pos, n_rows=10, show_shape=True, n_cols=5)

DataFrame shape: (6, 40)

	1	2	3	4	5
ASA/Volume	0.000000	0.000000	0.000000	0.000000	0.000000
Conformation	0.026359	0.000000	0.002911	-0.012762	0.000000
Energy	0.000000	0.000000	0.000000	0.000000	0.000000
Polarity	0.000000	0.000000	0.000000	-0.122285	0.000000
Shape	0.096837	-0.082142	0.000000	0.046300	0.020332
Structure-Activity	0.067900	0.067900	0.067900	0.067900	-0.077275