SequenceFeature.prune_by_correlation
- SequenceFeature.prune_by_correlation(df_feat=None, df_parts=None, df_scales=None, max_cor=0.7, X=None, accept_gaps=False, n_jobs=1)[source]
Prune mutually correlated features from a feature DataFrame.
Model-free feature pruning step: among features whose realized feature values (built from
df_parts, or supplied directly viaX) are pairwise correlated beyondmax_cor, keeps the one with the higherabs_aucand drops the others, returning the row-filtereddf_feat. Use it afterSequenceFeature.prune_by_variance()and beforeTreeModel.select_features().The correlation is empirical — measured over the actual samples in
df_parts. This is deliberately different from CPP’s in-run redundancy reduction, which compares the underlying scale vectors (df_scales.corr()) together with positional overlap. Pruning here catches features that happen to be redundant on a specific dataset even when their scales are not.Compared with the lower-level
NumericalFeature.filter_correlation(), which takes a raw matrixXand returns a boolean mask keeping the first column of each correlated pair (in the order given), this method is df_feat-in / df_feat-out: it buildsXfor you, ranks features by ``abs_auc`` first so the dropped feature of a pair is always the weaker one, and returns the row-filtereddf_feat.Added in version 1.1.0.
- Parameters:
df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature. Must contain the
abs_aucstatistic used as the deterministic tie-break.df_parts (pd.DataFrame, shape (n_samples, n_parts), optional) – DataFrame with sequence parts. Used to build the feature matrix; not required if
Xis given.df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from
load_scales()unless specified inoptions['df_scales'].max_cor (float, default=0.7) – Maximum absolute Pearson correlation [0-1] allowed between any two retained features. For each pair whose
|corr| > max_cor, the feature with the lowerabs_aucis dropped (and the higher-abs_aucone kept) — regardless of the input row order, because the method ranks byabs_aucinternally. Lowermax_corto prune more aggressively.X (array-like, shape (n_samples, n_features), optional) – Pre-computed feature matrix. Column
imust correspond to the feature in rowiof thedf_featyou pass (same order); the method then re-ranksdf_featandXtogether byabs_aucinternally, so you do not pre-sort. If given, it is used directly anddf_parts/df_scalesare ignored (e.g. to reuse a matrix or to prune aCPP.run_num()df_feat).accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if
True).n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.
- Returns:
df_feat – Feature DataFrame filtered to a non-redundant subset (sorted by descending
abs_auc), with a reset index.- Return type:
pd.DataFrame, shape (n_selected_features, n_feature_info)
Notes
Tie-break / determinism: features are sorted by descending
abs_auc(ties broken byabs_mean_dif) before pruning, so for every correlated pair the lower-abs_aucfeature is the one removed. This makes the output independent of the input row order and byte-identical across runs; the returneddf_featis in descending-abs_aucorder.X alignment: if you pass a pre-computed
X, its columns must be aligned to thedf_featrows you pass (columni= feature in rowi); the method reorders both together, so a mis-alignedXwould correlate the wrong features.The retained set is guaranteed to contain no feature pair with
|corr| > max_cor.Constant (zero-variance) features have undefined correlation and are always retained here; run
SequenceFeature.prune_by_variance()first to remove them.A
df_featwith fewer than two features is returned unchanged (nothing to compare).
See also
SequenceFeature.prune_by_variance()for the variance-pruning step that should precede this.NumericalFeature.filter_correlation()for the underlying correlation primitive on a matrix.TreeModel.select_features()for the model-based selection that follows pruning.
Examples
:meth:
SequenceFeature.prune_by_correlationremoves empirically redundant features: among features whose realized values are pairwise correlated beyondmax_cor, it keeps the one with the higherabs_aucand drops the others. The correlation is measured over the actual samples, making it complementary to CPP’s in-run redundancy reduction (which compares scale vectors). Use it after :meth:SequenceFeature.prune_by_varianceand before :meth:TreeModel.select_features.import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC", n=20) labels = df_seq["label"].to_list() sf = aa.SequenceFeature() # gamma-secretase geometry: the TMD (~20 aa) comes from each protein's tmd_start/tmd_stop, # flanked by short juxtamembrane domains of 4 residues each. df_parts = sf.get_df_parts(df_seq=df_seq, jmd_n_len=4, jmd_c_len=4) df_feat = aa.CPP(df_parts=df_parts).run(labels=labels, n_filter=50) print(f"features from CPP.run: {len(df_feat)}")
[94mCPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.[0m features from CPP.run: 50
At
max_cor=0.7no retained pair of features has absolute correlation above 0.7. The output is sorted by descendingabs_aucand is deterministic across runs:df_cor = sf.prune_by_correlation(df_feat=df_feat, df_parts=df_parts, max_cor=0.7) print(f"kept {len(df_cor)} of {len(df_feat)} features") aa.display_df(df_cor, n_rows=10, show_shape=True)
kept 12 of 50 features DataFrame shape: (12, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD-Pattern(C,4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.045706 23,27 2 TMD_C_JMD_C-Pat...5,8)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005435 33,36,39 3 TMD_C_JMD_C-Seg...,11)-AURR980119 Conformation α-helix (C-term, out) α-helix (C-terminal, outside) Normalized posi...ora-Rose, 1998) 0.392000 0.198000 0.198000 0.106000 0.178000 0.000022 0.005822 39,40 4 TMD-Segment(10,11)-CHAM820102 Polarity Hydrophobicity (interface) Free energy (interface) Free energy of ...-Charton, 1982) 0.389000 0.169000 -0.169000 0.050000 0.130000 0.000026 0.005837 27,28 5 TMD-Pattern(C,4...,11)-FUKS010106 Composition Membrane proteins (MPs) Proteins of mesophiles (INT) Interior compos...ishikawa, 2001) 0.384000 0.219000 0.219000 0.166000 0.122000 0.000033 0.006680 20,23,27 6 TMD_C_JMD_C-Pat...,12)-AURR980108 Conformation α-helix α-helix (N-terminal, inside) Normalized posi...ora-Rose, 1998) 0.371000 0.165000 0.165000 0.087000 0.115000 0.000059 0.005589 21,25,28,32 7 TMD_C_JMD_C-Pat...,12)-SNEP660101 Others PC 1 Principal Component 1 (Sneath) Principal compo... (Sneath, 1966) 0.371000 0.150000 0.150000 0.092000 0.109000 0.000059 0.006001 29,33,37 8 TMD-Segment(11,12)-GUYH850105 ASA/Volume Accessible surface area (ASA) Partition energy Apparent partit...dex (Guy, 1985) 0.371000 0.131000 -0.131000 0.050000 0.144000 0.000059 0.005720 27,28 9 TMD_C_JMD_C-Pat...5,8)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.368000 0.187000 0.187000 0.145000 0.124000 0.000070 0.005918 33,36,39 10 TMD_C_JMD_C-Pat...,11)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.361000 0.324000 -0.324000 0.119000 0.268000 0.000093 0.005763 30,33 As with variance pruning,
df_scales/accept_gaps/n_jobsconfigure the matrix build, and a pre-computedX(aligned column-for-column withdf_feat) can be supplied instead ofdf_parts:df_scales = aa.load_scales() df_cor_full = sf.prune_by_correlation(df_feat=df_feat, df_parts=df_parts, df_scales=df_scales, max_cor=0.5, accept_gaps=True, n_jobs=1) X = sf.feature_matrix(features=df_feat, df_parts=df_parts) df_cor_X = sf.prune_by_correlation(df_feat=df_feat, X=X, max_cor=0.5) print(f"via df_parts+params: {len(df_cor_full)} | via pre-computed X: {len(df_cor_X)}")
via df_parts+params: 3 | via pre-computed X: 3
The two pruners compose, and the result drops straight into model-based selection (:meth:
TreeModel.select_features):df_pruned = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, threshold=0.0) df_pruned = sf.prune_by_correlation(df_feat=df_pruned, df_parts=df_parts, max_cor=0.5) print(f"variance -> correlation kept {len(df_pruned)} of {len(df_feat)} features") aa.display_df(df_pruned, n_rows=10, show_shape=True)
variance -> correlation kept 3 of 50 features DataFrame shape: (3, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD-Pattern(C,4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.045706 23,27 2 TMD_C_JMD_C-Pat...5,8)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005435 33,36,39 3 TMD_C_JMD_C-Pat...6,9)-GEIM800103 Conformation Unclassified (Conformation) α-helix (β-proteins) Alpha-helix ind...-Roberts, 1980) 0.359000 0.163000 -0.163000 0.137000 0.080000 0.000104 0.005877 32,35,39 What can go wrong? A
df_featwith fewer than two features has nothing to compare and is returned unchanged:one = df_feat.head(1) print(f"single-feature input -> {len(sf.prune_by_correlation(df_feat=one, df_parts=df_parts))} feature")
single-feature input -> 1 feature