aaanalysis.CPP.run
- CPP.run(labels=None, label_test=1, label_ref=0, n_filter=100, n_pre_filter=None, pct_pre_filter=5, max_std_test=0.2, max_overlap=0.5, max_cor=0.5, check_cat=True, parametric=False, start=1, tmd_len=20, jmd_n_len=10, jmd_c_len=10, n_jobs=None, vectorized=True, n_batches=None, n_sample_batches=None, return_stats=False)[source]
Perform Comparative Physicochemical Profiling (CPP) algorithm: creation and two-step filtering of interpretable sequence-based features.
The aim of the CPP algorithm is to identify a set of unique, non-redundant features that are most discriminant between the test and reference group of sequences. See [Breimann25a] for details on the algorithm.
Changed in version 1.1.0: Added the
return_statsparameter, returning the filter-funnel statistics alongsidedf_feat.Changed in version 1.1.0: Added the
n_sample_batchesparameter for sample-axis batching (memory bounded by batch size, not n).- Parameters:
labels (array-like, shape (n_samples,)) – Class labels for samples in sequence DataFrame (typically, test=1, reference=0).
label_test (int, default=1,) – Class label of test group in
labels.label_ref (int, default=0,) – Class label of reference group in
labels.n_filter (int, default=100) – Number of features to be filtered/selected by CPP algorithm.
n_pre_filter (int, optional) – Number of feature to be pre-filtered by CPP algorithm. If
None, a percentage of all features is used.pct_pre_filter (int, default=5) – Percentage of all features that should remain after the pre-filtering step.
max_std_test (float, default=0.2) – Maximum standard deviation [>0-<1] within the test group used as threshold for pre-filtering.
max_overlap (float, default=0.5) – Maximum positional overlap [0-1] of features used as threshold for filtering.
max_cor (float, default=0.5) – Maximum Pearson correlation [0-1] of feature scales used as threshold for filtering.
check_cat (bool, default=True) – Whether to check for redundancy within scale categories during filtering.
parametric (bool, default=False) – Whether to use parametric (T-test) or non-parametric (Mann-Whitney U test) test for p-value computation.
start (int, default=1) – Position label of first residue position (starting at N-terminus).
tmd_len (int, default=20) – Length of TMD (>0).
jmd_n_len (int, default=10) – Length of JMD-N (>=0).
jmd_c_len (int, default=10) – Length of JMD-C (>=0).
n_jobs (int, None, or -1, default=None) –
Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.Warning
On Python 3.14 + macOS, calling this with
n_jobs > 1(or-1/None) from a script that lacks anif __name__ == "__main__":guard (or from a bare REPL / heredoc) can trigger a recursive process spawn (FileNotFoundError/EOFError/cannot pickle '_thread.RLock'). Guard your entry point, or run serially withn_jobs=1. See alsoCPPGrid(defaultbackend="threads"), which sidesteps this.vectorized (bool, default=True) – Whether to apply sequence splitting and the Mann-Whitney U test in ‘vectorized’ mode (
True), improving speed but increasing memory consumption.n_batches (int, None, default=None) – Number of batches (>=2) used for batch processing. If
None, single-processing is used, which is faster but more memory-intensive. Increasingn_batches(up to the maximum number of scales indf_scales) reduces memory consumption but slows down processing.n_sample_batches (int, None, default=None) – Number of sample-axis batches (>=2, up to the number of samples) for sample-batched processing. If
None, sample-batching is disabled. Bounds peak memory by the batch size rather than the full sample countn, so it is the option for very largen. Mutually exclusive withn_batches(which batches over scales).
- Returns:
df_feat – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
- Return type:
pd.DataFrame, shape (n_features, n_feature_info)
Notes
Pre-filtering can be adjusted by the following parameters: {‘n_pre_filter’, ‘pct_pre_filter’, ‘max_std_test’}.
Filtering can be adjusted by the following parameters: {‘n_filter’, ‘max_overlap’, ‘max_cor’, ‘check_cat’}.
Cost scales as
O(n_scales x n_parts x n_splits)(the candidate feature count), so larger scale sets / widersplit_kwsare proportionally slower — budget a sweep accordingly, or useCPPGrid(which runs CPP once pern_filtergroup and slices the rest).Classifier head tracks the metric when training a downstream model on
df_feat: in practice SVM tends to be best for AP (ranking), logistic regression for balanced accuracy, and random forest for MCC at a fixed threshold (detection). Pick the head to match the objective you report.For large datasets (due to long sequences or a high number of samples) or memory-limited systems, memory consumption can be reduced by:
Disabling vectorized mode (
vectorized=False)Reducing
n_jobs(down ton_jobs=1)Using batch processing (
n_batches>=2, with higher values reducing memory usage)
While this helps to prevent crashes, it may slow down processing.
df_featcontains the following 13 columns, including the unique feature id (1), scale information (2-5), statistical results for filtering and ranking (6-12), and feature positions (13):‘features’: Feature ID (PART-SPLIT-SCALE)
‘category’: Scale category
‘subcategory’: Sub category of scales
‘scale_name’: Name of scales
‘scale_description’: Description of the scale
‘abs_auc’: Absolute adjusted AUC (area under the curve) [-0.5 to 0.5]
‘abs_mean_dif’: Absolute mean differences between test and reference group [0 to 1]
‘mean_dif’: Mean differences between test and reference group [-1 to 1]
‘std_test’: Standard deviation in test group
‘std_ref’: Standard deviation in reference group
‘p_val’: Non-parametric (mann_whitney) or parametric (ttest_indep) statistic
‘p_val_fdr_bh’: Benjamini-Hochberg FDR corrected p-values
‘positions’: Feature positions for default settings
See also
comp_auc_adjusted()for details onabs_auc.
Examples
To demonstrate the
CPP().run()method, we load theDOM_GSECexample dataset (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC") labels = df_seq["label"].to_list() sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq)
You just need to provide
df_partsto theCPPobject and run the algorithm with its respective labels using theCPP().run()method:cpp = aa.CPP(df_parts=df_parts) # Create >500,000 feature and filter them down to 100 features df_feat = cpp.run(labels=labels) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (100, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_C_JMD_C-Seg...2,3)-QIAN880106 Conformation α-helix α-helix (middle) Weights for alp...ejnowski, 1988) 0.387000 0.118000 0.118000 0.068000 0.080000 0.000000 0.000000 27,28,29,30,31,32,33 2 TMD_C_JMD_C-Pat...,14)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.377000 0.285000 -0.285000 0.164000 0.177000 0.000000 0.000000 27,31 3 TMD_C_JMD_C-Seg...6,9)-FAUJ880104 Shape Side chain length Steric parameter STERIMOL length...e et al., 1988) 0.367000 0.263000 0.263000 0.161000 0.168000 0.000000 0.000000 32,33 4 TMD_C_JMD_C-Seg...6,9)-ONEK900101 Others Unclassified (Others) ΔG values in peptides Delta G values ...-DeGrado, 1990) 0.366000 0.111000 0.111000 0.070000 0.114000 0.000000 0.000000 32,33 5 TMD_C_JMD_C-Pat...,15)-QIAN880107 Conformation α-helix α-helix (middle) Weights for alp...ejnowski, 1988) 0.363000 0.162000 0.162000 0.091000 0.118000 0.000000 0.000000 24,28,32,35 6 TMD_C_JMD_C-Seg...3,4)-HUTJ700103 Energy Entropy Entropy Entropy of form...Hutchens, 1970) 0.360000 0.187000 0.187000 0.115000 0.128000 0.000000 0.000000 31,32,33,34,35 7 TMD_C_JMD_C-Seg...2,3)-WOLS870103 Others PC 4 Principal Component 3 (Wold) Principal prope...d et al., 1987) 0.359000 0.159000 -0.159000 0.090000 0.130000 0.000000 0.000000 27,28,29,30,31,32,33 8 TMD_C_JMD_C-Pat...,12)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.352000 0.227000 -0.227000 0.150000 0.170000 0.000000 0.000000 24,28,32 9 TMD_C_JMD_C-Seg...6,9)-MUNV940102 Energy Free energy (folding) Free energy (α-helix) Free energy in ...-Serrano, 1994) 0.350000 0.129000 -0.129000 0.079000 0.124000 0.000000 0.000000 32,33 10 TMD_C_JMD_C-Seg...3,4)-WOLS870103 Others PC 4 Principal Component 3 (Wold) Principal prope...d et al., 1987) 0.341000 0.214000 -0.214000 0.128000 0.177000 0.000000 0.000000 31,32,33,34,35 Adjust Parts, Splits, and Scales as follows:
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd_jmd"]) split_kws = sf.get_split_kws(split_types=["Segment"], n_split_min=1, n_split_max=5) # Load one of the provided top scale datasets df_scales = aa.load_scales(top60_n=38) # Create ~700 feature and filter them down to 19 features cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales) df_feat = cpp.run(labels=labels) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (19, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...4,5)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.316000 0.137000 -0.137000 0.102000 0.108000 0.000000 0.000000 25,26,27,28,29,30,31,32 2 TMD_JMD-Segment...4,4)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.312000 0.099000 0.099000 0.069000 0.095000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 3 TMD_JMD-Segment...4,5)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.297000 0.086000 0.086000 0.077000 0.068000 0.000000 0.000000 25,26,27,28,29,30,31,32 4 TMD_JMD-Segment...5,5)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.295000 0.141000 0.141000 0.115000 0.130000 0.000000 0.000000 33,34,35,36,37,38,39,40 5 TMD_JMD-Segment...5,5)-JANJ780102 ASA/Volume Buried Buried Percentage of b...n et al., 1978) 0.291000 0.130000 -0.130000 0.099000 0.124000 0.000000 0.000000 33,34,35,36,37,38,39,40 6 TMD_JMD-Segment...5,5)-ZIMJ680103 Polarity Hydrophilicity Polarity (hydrophilicity) Polarity (Zimme...n et al., 1968) 0.289000 0.178000 0.178000 0.159000 0.163000 0.000000 0.000000 33,34,35,36,37,38,39,40 7 TMD_JMD-Segment...4,5)-FUKS010106 Composition Membrane proteins (MPs) Proteins of mesophiles (INT) Interior compos...ishikawa, 2001) 0.277000 0.123000 0.123000 0.104000 0.127000 0.000000 0.000000 25,26,27,28,29,30,31,32 8 TMD_JMD-Segment...4,4)-WOLR790101 Polarity Hydrophobicity (surrounding) Hydration potential Hydrophobicity ...n et al., 1979) 0.267000 0.105000 -0.105000 0.100000 0.113000 0.000000 0.000001 31,32,33,34,35,36,37,38,39,40 9 TMD_JMD-Segment...2,2)-CEDJ970105 Composition AA composition Nuclear proteins Composition of ...o et al., 1997) 0.263000 0.062000 0.062000 0.062000 0.069000 0.000000 0.000001 21,22,23,24,25,...,36,37,38,39,40 10 TMD_JMD-Segment...5,5)-MITS020101 Polarity Amphiphilicity Amphiphilicity Amphiphilicity ...u et al., 2002) 0.262000 0.073000 0.073000 0.071000 0.086000 0.000000 0.000001 33,34,35,36,37,38,39,40 The maximum number of final features can be adjusted using the
n_filter(default=100) parameter. The actual number of features may be less, depending on: (a) the initial number of features generated (defined by thepart-split-scalecombinations), and (b) the strictness of both pre-filtering and filtering criteria.# Create ~700 feature and filter them down to 10 features df_feat = cpp.run(labels=labels, n_filter=10) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (10, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...4,5)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.316000 0.137000 -0.137000 0.102000 0.108000 0.000000 0.000000 25,26,27,28,29,30,31,32 2 TMD_JMD-Segment...4,4)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.312000 0.099000 0.099000 0.069000 0.095000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 3 TMD_JMD-Segment...4,5)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.297000 0.086000 0.086000 0.077000 0.068000 0.000000 0.000000 25,26,27,28,29,30,31,32 4 TMD_JMD-Segment...5,5)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.295000 0.141000 0.141000 0.115000 0.130000 0.000000 0.000000 33,34,35,36,37,38,39,40 5 TMD_JMD-Segment...5,5)-JANJ780102 ASA/Volume Buried Buried Percentage of b...n et al., 1978) 0.291000 0.130000 -0.130000 0.099000 0.124000 0.000000 0.000000 33,34,35,36,37,38,39,40 6 TMD_JMD-Segment...5,5)-ZIMJ680103 Polarity Hydrophilicity Polarity (hydrophilicity) Polarity (Zimme...n et al., 1968) 0.289000 0.178000 0.178000 0.159000 0.163000 0.000000 0.000000 33,34,35,36,37,38,39,40 7 TMD_JMD-Segment...4,5)-FUKS010106 Composition Membrane proteins (MPs) Proteins of mesophiles (INT) Interior compos...ishikawa, 2001) 0.277000 0.123000 0.123000 0.104000 0.127000 0.000000 0.000000 25,26,27,28,29,30,31,32 8 TMD_JMD-Segment...4,4)-WOLR790101 Polarity Hydrophobicity (surrounding) Hydration potential Hydrophobicity ...n et al., 1979) 0.267000 0.105000 -0.105000 0.100000 0.113000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 9 TMD_JMD-Segment...5,5)-MIYS990104 Composition MPs (anchor) Partition energy Optimized relat...Jernigan, 1999) 0.243000 0.103000 0.103000 0.095000 0.126000 0.000002 0.000004 33,34,35,36,37,38,39,40 10 TMD_JMD-Segment...4,5)-ANDN920101 Structure-Activity Backbone-dynamics (-CH) α-CH chemical s...kbone-dynamics) alpha-CH chemic...n et al., 1992) 0.229000 0.102000 -0.102000 0.097000 0.125000 0.000009 0.000012 25,26,27,28,29,30,31,32 In the initial CPP pre-filtering step, you can either set the number of retained features using
n_pre_filteror define a percentage of initial features withpct_pre_filter(default with 5%). Additionally, adjust the maximum standard deviation allowed in the test dataset for each feature viamax_std_test:# Pre-filtering by allowing 50% with 0.5 maximum std in the test set # Create ~700 feature and filter them down to 26 features df_feat = cpp.run(labels=labels, pct_pre_filter=50, max_std_test=0.5) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (26, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...4,5)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.316000 0.137000 -0.137000 0.102000 0.108000 0.000000 0.000000 25,26,27,28,29,30,31,32 2 TMD_JMD-Segment...4,4)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.312000 0.099000 0.099000 0.069000 0.095000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 3 TMD_JMD-Segment...2,2)-ONEK900101 Others Unclassified (Others) ΔG values in peptides Delta G values ...-DeGrado, 1990) 0.310000 0.041000 0.041000 0.028000 0.044000 0.000000 0.000000 21,22,23,24,25,...,36,37,38,39,40 4 TMD_JMD-Segment...4,5)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.297000 0.086000 0.086000 0.077000 0.068000 0.000000 0.000000 25,26,27,28,29,30,31,32 5 TMD_JMD-Segment...5,5)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.295000 0.141000 0.141000 0.115000 0.130000 0.000000 0.000001 33,34,35,36,37,38,39,40 6 TMD_JMD-Segment...5,5)-JANJ780102 ASA/Volume Buried Buried Percentage of b...n et al., 1978) 0.291000 0.130000 -0.130000 0.099000 0.124000 0.000000 0.000001 33,34,35,36,37,38,39,40 7 TMD_JMD-Segment...5,5)-ZIMJ680103 Polarity Hydrophilicity Polarity (hydrophilicity) Polarity (Zimme...n et al., 1968) 0.289000 0.178000 0.178000 0.159000 0.163000 0.000000 0.000001 33,34,35,36,37,38,39,40 8 TMD_JMD-Segment...4,5)-FUKS010106 Composition Membrane proteins (MPs) Proteins of mesophiles (INT) Interior compos...ishikawa, 2001) 0.277000 0.123000 0.123000 0.104000 0.127000 0.000000 0.000001 25,26,27,28,29,30,31,32 9 TMD_JMD-Segment...3,4)-WOLR790101 Polarity Hydrophobicity (surrounding) Hydration potential Hydrophobicity ...n et al., 1979) 0.274000 0.052000 0.052000 0.034000 0.060000 0.000000 0.000001 21,22,23,24,25,26,27,28,29,30 10 TMD_JMD-Segment...1,2)-WEBA780101 Others Mutability RF value RF value in hig...er-Lacey, 1978) 0.268000 0.042000 0.042000 0.039000 0.046000 0.000000 0.000002 1,2,3,4,5,6,7,8...,16,17,18,19,20 For the final CPP filtering step, you can use the following three parameters:
max_overlapsetting the allowed maximum positional overlap of similar features (the higher, the less strict),max_cordefining the allowed maximum Pearson correlation for scales of similar features (the higher, the less strict), andcheck_catsetting whether redundancy of scale categories should be considered or not (setting it toFalsewill result in stricter filtering since features across all categories are compared):# Disable filtering by setting max_overlap and max_cor to 1 # Create ~700 feature and filter them down to 100 features df_feat = cpp.run(labels=labels, max_overlap=1, max_cor=1) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (100, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...4,5)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.316000 0.137000 -0.137000 0.102000 0.108000 0.000000 0.000000 25,26,27,28,29,30,31,32 2 TMD_JMD-Segment...4,4)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.312000 0.099000 0.099000 0.069000 0.095000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 3 TMD_JMD-Segment...3,3)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.304000 0.069000 0.069000 0.051000 0.073000 0.000000 0.000000 27,28,29,30,31,...,36,37,38,39,40 4 TMD_JMD-Segment...4,5)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.297000 0.086000 0.086000 0.077000 0.068000 0.000000 0.000000 25,26,27,28,29,30,31,32 5 TMD_JMD-Segment...5,5)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.295000 0.141000 0.141000 0.115000 0.130000 0.000000 0.000000 33,34,35,36,37,38,39,40 6 TMD_JMD-Segment...2,2)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.292000 0.058000 0.058000 0.045000 0.054000 0.000000 0.000000 21,22,23,24,25,...,36,37,38,39,40 7 TMD_JMD-Segment...5,5)-JANJ780102 ASA/Volume Buried Buried Percentage of b...n et al., 1978) 0.291000 0.130000 -0.130000 0.099000 0.124000 0.000000 0.000000 33,34,35,36,37,38,39,40 8 TMD_JMD-Segment...4,4)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.291000 0.127000 0.127000 0.097000 0.121000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 9 TMD_JMD-Segment...5,5)-ZIMJ680103 Polarity Hydrophilicity Polarity (hydrophilicity) Polarity (Zimme...n et al., 1968) 0.289000 0.178000 0.178000 0.159000 0.163000 0.000000 0.000000 33,34,35,36,37,38,39,40 10 TMD_JMD-Segment...4,4)-ZIMJ680103 Polarity Hydrophilicity Polarity (hydrophilicity) Polarity (Zimme...n et al., 1968) 0.288000 0.164000 0.164000 0.135000 0.145000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 # Perform stricter filtering by setting check_cat=False # Create ~700 feature and filter them down to 11 features df_feat = cpp.run(labels=labels, check_cat=False) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (11, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...4,5)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.316000 0.137000 -0.137000 0.102000 0.108000 0.000000 0.000000 25,26,27,28,29,30,31,32 2 TMD_JMD-Segment...4,4)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.312000 0.099000 0.099000 0.069000 0.095000 0.000000 0.000000 31,32,33,34,35,36,37,38,39,40 3 TMD_JMD-Segment...4,5)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.297000 0.086000 0.086000 0.077000 0.068000 0.000000 0.000000 25,26,27,28,29,30,31,32 4 TMD_JMD-Segment...5,5)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.295000 0.141000 0.141000 0.115000 0.130000 0.000000 0.000000 33,34,35,36,37,38,39,40 5 TMD_JMD-Segment...5,5)-JANJ780102 ASA/Volume Buried Buried Percentage of b...n et al., 1978) 0.291000 0.130000 -0.130000 0.099000 0.124000 0.000000 0.000000 33,34,35,36,37,38,39,40 6 TMD_JMD-Segment...2,2)-CEDJ970105 Composition AA composition Nuclear proteins Composition of ...o et al., 1997) 0.263000 0.062000 0.062000 0.062000 0.069000 0.000000 0.000001 21,22,23,24,25,...,36,37,38,39,40 7 TMD_JMD-Segment...5,5)-MITS020101 Polarity Amphiphilicity Amphiphilicity Amphiphilicity ...u et al., 2002) 0.262000 0.073000 0.073000 0.071000 0.086000 0.000000 0.000001 33,34,35,36,37,38,39,40 8 TMD_JMD-Segment...1,2)-SIMZ760101 Polarity Hydrophobicity Transfer free e...TFE) to outside Transfer free e...-Charton (1982) 0.259000 0.064000 -0.064000 0.069000 0.072000 0.000001 0.000002 1,2,3,4,5,6,7,8...,16,17,18,19,20 9 TMD_JMD-Segment...4,5)-ANDN920101 Structure-Activity Backbone-dynamics (-CH) α-CH chemical s...kbone-dynamics) alpha-CH chemic...n et al., 1992) 0.229000 0.102000 -0.102000 0.097000 0.125000 0.000009 0.000017 25,26,27,28,29,30,31,32 10 TMD_JMD-Segment...4,4)-YUTK870103 Energy Free energy (unfolding) Free energy (unfolding) Activation Gibb...i et al., 1987) 0.201000 0.084000 -0.084000 0.115000 0.118000 0.000103 0.000143 31,32,33,34,35,36,37,38,39,40 The residue positions can be adjusted using the
start,tmd_len,jmd_n_len, andjmd_c_lenparameters:# Shift positions by 10 residues df_feat = cpp.run(labels=labels, start=11) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (19, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...4,5)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.316000 0.137000 -0.137000 0.102000 0.108000 0.000000 0.000000 35,36,37,38,39,40,41,42 2 TMD_JMD-Segment...4,4)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.312000 0.099000 0.099000 0.069000 0.095000 0.000000 0.000000 41,42,43,44,45,46,47,48,49,50 3 TMD_JMD-Segment...4,5)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.297000 0.086000 0.086000 0.077000 0.068000 0.000000 0.000000 35,36,37,38,39,40,41,42 4 TMD_JMD-Segment...5,5)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.295000 0.141000 0.141000 0.115000 0.130000 0.000000 0.000000 43,44,45,46,47,48,49,50 5 TMD_JMD-Segment...5,5)-JANJ780102 ASA/Volume Buried Buried Percentage of b...n et al., 1978) 0.291000 0.130000 -0.130000 0.099000 0.124000 0.000000 0.000000 43,44,45,46,47,48,49,50 6 TMD_JMD-Segment...5,5)-ZIMJ680103 Polarity Hydrophilicity Polarity (hydrophilicity) Polarity (Zimme...n et al., 1968) 0.289000 0.178000 0.178000 0.159000 0.163000 0.000000 0.000000 43,44,45,46,47,48,49,50 7 TMD_JMD-Segment...4,5)-FUKS010106 Composition Membrane proteins (MPs) Proteins of mesophiles (INT) Interior compos...ishikawa, 2001) 0.277000 0.123000 0.123000 0.104000 0.127000 0.000000 0.000000 35,36,37,38,39,40,41,42 8 TMD_JMD-Segment...4,4)-WOLR790101 Polarity Hydrophobicity (surrounding) Hydration potential Hydrophobicity ...n et al., 1979) 0.267000 0.105000 -0.105000 0.100000 0.113000 0.000000 0.000001 41,42,43,44,45,46,47,48,49,50 9 TMD_JMD-Segment...2,2)-CEDJ970105 Composition AA composition Nuclear proteins Composition of ...o et al., 1997) 0.263000 0.062000 0.062000 0.062000 0.069000 0.000000 0.000001 31,32,33,34,35,...,46,47,48,49,50 10 TMD_JMD-Segment...5,5)-MITS020101 Polarity Amphiphilicity Amphiphilicity Amphiphilicity ...u et al., 2002) 0.262000 0.073000 0.073000 0.071000 0.086000 0.000000 0.000001 43,44,45,46,47,48,49,50 # Increase TMD length from 20 to 50 df_feat = cpp.run(labels=labels, tmd_len=50) aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (19, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD_JMD-Segment...4,5)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.316000 0.137000 -0.137000 0.102000 0.108000 0.000000 0.000000 43,44,45,46,47,...,52,53,54,55,56 2 TMD_JMD-Segment...4,4)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.312000 0.099000 0.099000 0.069000 0.095000 0.000000 0.000000 53,54,55,56,57,...,66,67,68,69,70 3 TMD_JMD-Segment...4,5)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.297000 0.086000 0.086000 0.077000 0.068000 0.000000 0.000000 43,44,45,46,47,...,52,53,54,55,56 4 TMD_JMD-Segment...5,5)-LINS030104 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Total median ac...s et al., 2003) 0.295000 0.141000 0.141000 0.115000 0.130000 0.000000 0.000000 57,58,59,60,61,...,66,67,68,69,70 5 TMD_JMD-Segment...5,5)-JANJ780102 ASA/Volume Buried Buried Percentage of b...n et al., 1978) 0.291000 0.130000 -0.130000 0.099000 0.124000 0.000000 0.000000 57,58,59,60,61,...,66,67,68,69,70 6 TMD_JMD-Segment...5,5)-ZIMJ680103 Polarity Hydrophilicity Polarity (hydrophilicity) Polarity (Zimme...n et al., 1968) 0.289000 0.178000 0.178000 0.159000 0.163000 0.000000 0.000000 57,58,59,60,61,...,66,67,68,69,70 7 TMD_JMD-Segment...4,5)-FUKS010106 Composition Membrane proteins (MPs) Proteins of mesophiles (INT) Interior compos...ishikawa, 2001) 0.277000 0.123000 0.123000 0.104000 0.127000 0.000000 0.000000 43,44,45,46,47,...,52,53,54,55,56 8 TMD_JMD-Segment...4,4)-WOLR790101 Polarity Hydrophobicity (surrounding) Hydration potential Hydrophobicity ...n et al., 1979) 0.267000 0.105000 -0.105000 0.100000 0.113000 0.000000 0.000001 53,54,55,56,57,...,66,67,68,69,70 9 TMD_JMD-Segment...2,2)-CEDJ970105 Composition AA composition Nuclear proteins Composition of ...o et al., 1997) 0.263000 0.062000 0.062000 0.062000 0.069000 0.000000 0.000001 36,37,38,39,40,...,66,67,68,69,70 10 TMD_JMD-Segment...5,5)-MITS020101 Polarity Amphiphilicity Amphiphilicity Amphiphilicity ...u et al., 2002) 0.262000 0.073000 0.073000 0.071000 0.086000 0.000000 0.000001 57,58,59,60,61,...,66,67,68,69,70 Multiprocessing can be enabled by using the
n_jobsparameter, which is set to the maximum ifn_jobs=None. However, this is only recommend for more than ~1000 features per core due to potential process management overhead.import time # Run without multiprocessing time_start = time.time() df_feat = cpp.run(labels=labels, n_jobs=1) time_no_mp = round(time.time() - time_start, 2) print(f"Time without multiprocessing: {time_no_mp} seconds") # Run with multiprocessing time_start = time.time() df_feat = cpp.run(labels=labels, n_jobs=None) time_mp = round(time.time() - time_start, 2) print(f"Time with multiprocessing. {time_mp} seconds")
Time without multiprocessing: 0.09 seconds Time with multiprocessing. 2.55 seconds
- Parameters:
return_stats (
bool)