SequenceFeature.prune_by_correlation

SequenceFeature.prune_by_correlation(df_feat=None, df_parts=None, df_scales=None, max_cor=0.7, X=None, accept_gaps=False, n_jobs=1)[source]

Prune mutually correlated features from a feature DataFrame.

Model-free feature pruning step: among features whose realized feature values (built from df_parts, or supplied directly via X) are pairwise correlated beyond max_cor, keeps the one with the higher abs_auc and drops the others, returning the row-filtered df_feat. Use it after SequenceFeature.prune_by_variance() and before TreeModel.select_features().

The correlation is empirical — measured over the actual samples in df_parts. This is deliberately different from CPP’s in-run redundancy reduction, which compares the underlying scale vectors (df_scales.corr()) together with positional overlap. Pruning here catches features that happen to be redundant on a specific dataset even when their scales are not.

Compared with the lower-level NumericalFeature.filter_correlation(), which takes a raw matrix X and returns a boolean mask keeping the first column of each correlated pair (in the order given), this method is df_feat-in / df_feat-out: it builds X for you, ranks features by ``abs_auc`` first so the dropped feature of a pair is always the weaker one, and returns the row-filtered df_feat.

Added in version 1.1.0.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature. Must contain the abs_auc statistic used as the deterministic tie-break.

  • df_parts (pd.DataFrame, shape (n_samples, n_parts), optional) – DataFrame with sequence parts. Used to build the feature matrix; not required if X is given.

  • df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].

  • max_cor (float, default=0.7) – Maximum absolute Pearson correlation [0-1] allowed between any two retained features. For each pair whose |corr| > max_cor, the feature with the lower abs_auc is dropped (and the higher-abs_auc one kept) — regardless of the input row order, because the method ranks by abs_auc internally. Lower max_cor to prune more aggressively.

  • X (array-like, shape (n_samples, n_features), optional) – Pre-computed feature matrix. Column i must correspond to the feature in row i of the df_feat you pass (same order); the method then re-ranks df_feat and X together by abs_auc internally, so you do not pre-sort. If given, it is used directly and df_parts / df_scales are ignored (e.g. to reuse a matrix or to prune a CPP.run_num() df_feat).

  • accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).

  • n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_feat – Feature DataFrame filtered to a non-redundant subset (sorted by descending abs_auc), with a reset index.

Return type:

pd.DataFrame, shape (n_selected_features, n_feature_info)

Notes

  • Tie-break / determinism: features are sorted by descending abs_auc (ties broken by abs_mean_dif) before pruning, so for every correlated pair the lower-abs_auc feature is the one removed. This makes the output independent of the input row order and byte-identical across runs; the returned df_feat is in descending-abs_auc order.

  • X alignment: if you pass a pre-computed X, its columns must be aligned to the df_feat rows you pass (column i = feature in row i); the method reorders both together, so a mis-aligned X would correlate the wrong features.

  • The retained set is guaranteed to contain no feature pair with |corr| > max_cor.

  • Constant (zero-variance) features have undefined correlation and are always retained here; run SequenceFeature.prune_by_variance() first to remove them.

  • A df_feat with fewer than two features is returned unchanged (nothing to compare).

See also

Examples

:meth:SequenceFeature.prune_by_correlation removes empirically redundant features: among features whose realized values are pairwise correlated beyond max_cor, it keeps the one with the higher abs_auc and drops the others. The correlation is measured over the actual samples, making it complementary to CPP’s in-run redundancy reduction (which compares scale vectors). Use it after :meth:SequenceFeature.prune_by_variance and before :meth:TreeModel.select_features.

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
# gamma-secretase geometry: the TMD (~20 aa) comes from each protein's tmd_start/tmd_stop,
# flanked by short juxtamembrane domains of 4 residues each.
df_parts = sf.get_df_parts(df_seq=df_seq, jmd_n_len=4, jmd_c_len=4)
df_feat = aa.CPP(df_parts=df_parts).run(labels=labels, n_filter=50)
print(f"features from CPP.run: {len(df_feat)}")
CPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
features from CPP.run: 50

At max_cor=0.7 no retained pair of features has absolute correlation above 0.7. The output is sorted by descending abs_auc and is deterministic across runs:

df_cor = sf.prune_by_correlation(df_feat=df_feat, df_parts=df_parts, max_cor=0.7)
print(f"kept {len(df_cor)} of {len(df_feat)} features")
aa.display_df(df_cor, n_rows=10, show_shape=True)
kept 12 of 50 features
DataFrame shape: (12, 13)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions
1 TMD-Pattern(C,4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.045706 23,27
2 TMD_C_JMD_C-Pat...5,8)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005435 33,36,39
3 TMD_C_JMD_C-Seg...,11)-AURR980119 Conformation α-helix (C-term, out) α-helix (C-terminal, outside) Normalized posi...ora-Rose, 1998) 0.392000 0.198000 0.198000 0.106000 0.178000 0.000022 0.005822 39,40
4 TMD-Segment(10,11)-CHAM820102 Polarity Hydrophobicity (interface) Free energy (interface) Free energy of ...-Charton, 1982) 0.389000 0.169000 -0.169000 0.050000 0.130000 0.000026 0.005837 27,28
5 TMD-Pattern(C,4...,11)-FUKS010106 Composition Membrane proteins (MPs) Proteins of mesophiles (INT) Interior compos...ishikawa, 2001) 0.384000 0.219000 0.219000 0.166000 0.122000 0.000033 0.006680 20,23,27
6 TMD_C_JMD_C-Pat...,12)-AURR980108 Conformation α-helix α-helix (N-terminal, inside) Normalized posi...ora-Rose, 1998) 0.371000 0.165000 0.165000 0.087000 0.115000 0.000059 0.005589 21,25,28,32
7 TMD_C_JMD_C-Pat...,12)-SNEP660101 Others PC 1 Principal Component 1 (Sneath) Principal compo... (Sneath, 1966) 0.371000 0.150000 0.150000 0.092000 0.109000 0.000059 0.006001 29,33,37
8 TMD-Segment(11,12)-GUYH850105 ASA/Volume Accessible surface area (ASA) Partition energy Apparent partit...dex (Guy, 1985) 0.371000 0.131000 -0.131000 0.050000 0.144000 0.000059 0.005720 27,28
9 TMD_C_JMD_C-Pat...5,8)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.368000 0.187000 0.187000 0.145000 0.124000 0.000070 0.005918 33,36,39
10 TMD_C_JMD_C-Pat...,11)-ROBB760113 Conformation β-turn β-turn Information mea...n-Suzuki, 1976) 0.361000 0.324000 -0.324000 0.119000 0.268000 0.000093 0.005763 30,33

As with variance pruning, df_scales / accept_gaps / n_jobs configure the matrix build, and a pre-computed X (aligned column-for-column with df_feat) can be supplied instead of df_parts:

df_scales = aa.load_scales()
df_cor_full = sf.prune_by_correlation(df_feat=df_feat, df_parts=df_parts, df_scales=df_scales,
                                      max_cor=0.5, accept_gaps=True, n_jobs=1)
X = sf.feature_matrix(features=df_feat, df_parts=df_parts)
df_cor_X = sf.prune_by_correlation(df_feat=df_feat, X=X, max_cor=0.5)
print(f"via df_parts+params: {len(df_cor_full)} | via pre-computed X: {len(df_cor_X)}")
via df_parts+params: 3 | via pre-computed X: 3

The two pruners compose, and the result drops straight into model-based selection (:meth:TreeModel.select_features):

df_pruned = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, threshold=0.0)
df_pruned = sf.prune_by_correlation(df_feat=df_pruned, df_parts=df_parts, max_cor=0.5)
print(f"variance -> correlation kept {len(df_pruned)} of {len(df_feat)} features")
aa.display_df(df_pruned, n_rows=10, show_shape=True)
variance -> correlation kept 3 of 50 features
DataFrame shape: (3, 13)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions
1 TMD-Pattern(C,4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.045706 23,27
2 TMD_C_JMD_C-Pat...5,8)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005435 33,36,39
3 TMD_C_JMD_C-Pat...6,9)-GEIM800103 Conformation Unclassified (Conformation) α-helix (β-proteins) Alpha-helix ind...-Roberts, 1980) 0.359000 0.163000 -0.163000 0.137000 0.080000 0.000104 0.005877 32,35,39

What can go wrong? A df_feat with fewer than two features has nothing to compare and is returned unchanged:

one = df_feat.head(1)
print(f"single-feature input -> {len(sf.prune_by_correlation(df_feat=one, df_parts=df_parts))} feature")
single-feature input -> 1 feature