SequenceFeature.get_df_feat

SequenceFeature.get_df_feat(features, df_parts, labels, label_test=1, label_ref=0, df_scales=None, df_cat=None, start=1, tmd_len=20, jmd_c_len=10, jmd_n_len=10, accept_gaps=False, parametric=False, n_jobs=1)[source]

Create feature DataFrame for given features.

Depending on the provided labels, the DataFrame is created for one of the three following cases:

Group vs group comparison

Sample vs group comparison

Sample vs sample comparison

For the group vs group comparison, the general feature position will be provided.
For sample vs group or sample vs sample comparison, the amino acid segments and patterns for the respective sample from the test dataset (label = 1) will be given.

Added in version 0.1.0.

Parameters:

features (array-like, shape (n_features,) or pd.DataFrame) – Ids of features ('PART-SPLIT-SCALE') for which df_feat should be created. Alternatively, a df_feat DataFrame, in which case its 'feature' column is used.
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – DataFrame with sequence parts. Must cover all parts in features.
labels (array-like, shape (n_samples,)) – Class labels for samples in df_parts. Should contain only two different integer label values, representing test and reference group (typically, 1 and 0).
label_test (int, default=1,) – Class label of test group in labels.
label_ref (int, default=0,) – Class label of reference group in labels.
df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].
df_cat (pd.DataFrame, shape (n_scales, n_scales_info), optional) – DataFrame of categories for physicochemical scales. Must contain all scales from df_scales. Default from load_scales() with name='scales_cat', unless specified in options['df_cat'].
start (int, default=1) – Position label of first residue position (starting at N-terminus).
tmd_len (int, default=20) – Length of target middle domain (TMD) (>0).
jmd_n_len (int, default=10) – Length of JMD-N (>=0).
jmd_c_len (int, default=10) – Length of JMD-C (>=0).
parametric (bool, default=False) – Whether to use parametric (T-test) or non-parametric (Mann-Whitney U test) test for p-value computation.
accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).
n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_feat – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

Return type:

pd.DataFrame, shape (n_features, n_feature_info)

Notes

Use parallel processing only for high number of features (>~1000 features per core)
For sample vs group or sample vs sample comparison, df_parts must comprise jmd_n, tmd, and jmd_c sequence parts as well as all parts in features.

See also

The CPP.run() method for creating and filtering Comparative Physicochemical Profiling (CPP) features for discriminating between two groups of sequences.

Examples

To demonstrate the SequenceFeature().get_df_feat() method, we load the DOM_GSEC example dataset including its respective features (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC")
features = df_feat["feature"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
aa.display_df(df_feat, n_rows=5)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_fdr_bh	positions	feat_importance	feat_importance_std
1	TMD_C_JMD_C-Seg...3,4)-KLEP840101	Energy	Charge	Charge	Net charge (Kle...n et al., 1984)	0.244000	0.103666	0.103666	0.106692	0.110506	0.000000	31,32,33,34,35	0.970400	1.438918
2	TMD_C_JMD_C-Seg...3,4)-FINA910104	Conformation	α-helix (C-cap)	α-helix termination	Helix terminati...n et al., 1991)	0.243000	0.085064	0.085064	0.098774	0.096946	0.000000	31,32,33,34,35	0.000000	0.000000
3	TMD_C_JMD_C-Seg...6,9)-LEVM760105	Shape	Side chain length	Side chain length	Radius of gyrat... (Levitt, 1976)	0.233000	0.137044	0.137044	0.161683	0.176964	0.000001	32,33	1.554800	2.109848
4	TMD_C_JMD_C-Seg...3,4)-HUTJ700102	Energy	Entropy	Entropy	Absolute entrop...Hutchens, 1970)	0.229000	0.098224	0.098224	0.106865	0.124608	0.000001	31,32,33,34,35	3.111200	3.109955
5	TMD_C_JMD_C-Seg...6,9)-RADA880106	ASA/Volume	Volume	Accessible surface area (ASA)	Accessible surf...olfenden, 1988)	0.223000	0.095071	0.095071	0.114758	0.132829	0.000002	32,33	0.000000	0.000000

features, df_parts, and the labels of the respective samples of the sequence DataFrame must be provided to retrieve the feature DataFrame:

# Mean difference values are higher because here negative samples (instead of unlabeled ones in Breimann25) are used as a reference dataset
df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels)
aa.display_df(df_feat, n_rows=5)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	positions
1	TMD_C_JMD_C-Seg...3,4)-KLEP840101	Energy	Charge	Charge	Net charge (Kle...n et al., 1984)	0.335000	0.168254	0.168254	0.106692	0.124924	31,32,33,34,35
2	TMD_C_JMD_C-Seg...3,4)-FINA910104	Conformation	α-helix (C-cap)	α-helix termination	Helix terminati...n et al., 1991)	0.333000	0.150698	0.150698	0.098774	0.119888	31,32,33,34,35
3	TMD_C_JMD_C-Seg...6,9)-LEVM760105	Shape	Side chain length	Side chain length	Radius of gyrat... (Levitt, 1976)	0.330000	0.246867	0.246867	0.161683	0.197489	32,33
4	TMD_C_JMD_C-Seg...3,4)-HUTJ700102	Energy	Entropy	Entropy	Absolute entrop...Hutchens, 1970)	0.327000	0.162229	0.162229	0.106865	0.135247	31,32,33,34,35
5	TMD_C_JMD_C-Seg...6,9)-RADA880106	ASA/Volume	Volume	Accessible surface area (ASA)	Accessible surf...olfenden, 1988)	0.322000	0.184252	0.184252	0.114758	0.164757	32,33

You can adjust the provided labels of the test and reference group using label_test and label_ref, which will alter the sign in mean_dif:

df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, label_test=0, label_ref=1)
# Mean difference values display opposite signs because they represent the computed difference between the mean of the test group and the mean of the reference group
aa.display_df(df_feat, n_rows=5, col_to_show="mean_dif")

	mean_dif
1	-0.168254
2	-0.150698
3	-0.246867
4	-0.162229
5	-0.184252

The residue positions can be adjusted using the start, tmd_len, jmd_n_len, and jmd_c_len parameters:

# Shift positions by 10 residues
df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels,
                         start=11)
aa.display_df(df_feat, n_rows=5, col_to_show="positions")

	positions
1	41,42,43,44,45
2	41,42,43,44,45
3	42,43
4	41,42,43,44,45
5	42,43

# Increase TMD length from 20 to 50
df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels,
                         tmd_len=50)
aa.display_df(df_feat, n_rows=5, col_to_show="positions")

	positions
1	53,54,55,56,57,58,59,60,61
2	53,54,55,56,57,58,59,60,61
3	55,56,57,58
4	53,54,55,56,57,58,59,60,61
5	55,56,57,58

T-test can be used instead of Mann-Whitney-U-test by setting parameteric=True:

df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels, parametric=True)
aa.display_df(df_feat, n_rows=5, col_to_show="p_val_ttest_indep")

	p_val_ttest_indep
1	0.000000
2	0.000000
3	0.000000
4	0.000000
5	0.000000

Further parameters. SequenceFeature.get_df_feat also accepts: df_scales — DataFrame of scales with letters typically representing amino acids; df_cat — DataFrame of categories for physicochemical scales; accept_gaps — Whether to accept missing values by enabling omitting for computations (if True); n_jobs — Number of CPU cores (>=1) used for multiprocessing.

# Further parameters demonstrated explicitly: the scale sources (df_scales, df_cat),
# gap handling (accept_gaps), the JMD flank lengths (jmd_n_len, jmd_c_len), and CPU cores (n_jobs)
df_scales = aa.load_scales()
df_cat = aa.load_scales(name="scales_cat")
df_feat = sf.get_df_feat(features=features, df_parts=df_parts, labels=labels,
                         df_scales=df_scales, df_cat=df_cat,
                         accept_gaps=False, jmd_n_len=10, jmd_c_len=10, n_jobs=1)
aa.display_df(df_feat, n_rows=10, show_shape=True)

DataFrame shape: (150, 13)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	positions
1	TMD_C_JMD_C-Seg...3,4)-KLEP840101	Energy	Charge	Charge	Net charge (Kle...n et al., 1984)	0.335000	0.168254	0.168254	0.106692	0.124924	31,32,33,34,35
2	TMD_C_JMD_C-Seg...3,4)-FINA910104	Conformation	α-helix (C-cap)	α-helix termination	Helix terminati...n et al., 1991)	0.333000	0.150698	0.150698	0.098774	0.119888	31,32,33,34,35
3	TMD_C_JMD_C-Seg...6,9)-LEVM760105	Shape	Side chain length	Side chain length	Radius of gyrat... (Levitt, 1976)	0.330000	0.246867	0.246867	0.161683	0.197489	32,33
4	TMD_C_JMD_C-Seg...3,4)-HUTJ700102	Energy	Entropy	Entropy	Absolute entrop...Hutchens, 1970)	0.327000	0.162229	0.162229	0.106865	0.135247	31,32,33,34,35
5	TMD_C_JMD_C-Seg...6,9)-RADA880106	ASA/Volume	Volume	Accessible surface area (ASA)	Accessible surf...olfenden, 1988)	0.322000	0.184252	0.184252	0.114758	0.164757	32,33
6	TMD_C_JMD_C-Seg...2,3)-KLEP840101	Energy	Charge	Charge	Net charge (Kle...n et al., 1984)	0.308000	0.092405	0.092405	0.064895	0.077319	27,28,29,30,31,32,33
7	TMD_C_JMD_C-Seg...4,5)-FAUJ880109	Energy	Isoelectric point	Number hydrogen bond donors	Number of hydro...e et al., 1988)	0.306000	0.230159	0.230159	0.174609	0.196434	33,34,35,36
8	TMD_C_JMD_C-Seg...3,4)-JANJ780101	ASA/Volume	Accessible surface area (ASA)	ASA (folded protein)	Average accessi...n et al., 1978)	0.304000	0.214790	0.214790	0.166309	0.191301	31,32,33,34,35
9	TMD_C_JMD_C-Seg...,10)-WILM950103	Polarity	Hydrophobicity (interface)	Hydrophobicity (interface)	Hydrophobicity ...e et al., 1995)	0.315000	0.246952	-0.246952	0.168603	0.241970	33,34
10	TMD_C_JMD_C-Seg...6,9)-AURR980110	Conformation	α-helix	α-helix (middle)	Normalized posi...ora-Rose, 1998)	0.343000	0.234114	0.234114	0.160819	0.172073	32,33