SequenceFeature.prune_by_correlation

SequenceFeature.prune_by_correlation(df_feat, df_parts=None, df_scales=None, max_cor=0.7, X=None, accept_gaps=False, n_jobs=1)[source]

Prune mutually correlated features from a feature DataFrame.

Model-free feature pruning step: among features whose realized feature values (built from df_parts, or supplied directly via X) are pairwise correlated beyond max_cor, keeps the one with the higher abs_auc and drops the others, returning the row-filtered df_feat. Use it after SequenceFeature.prune_by_variance() and before TreeModel.select_features().

The correlation is empirical — measured over the actual samples in df_parts. This is deliberately different from CPP’s in-run redundancy reduction, which compares the underlying scale vectors (df_scales.corr()) together with positional overlap. Pruning here catches features that happen to be redundant on a specific dataset even when their scales are not.

Compared with the lower-level NumericalFeature.filter_correlation(), which takes a raw matrix X and returns a boolean mask keeping the first column of each correlated pair (in the order given), this method is df_feat-in / df_feat-out: it builds X for you, ranks features by ``abs_auc`` first so the dropped feature of a pair is always the weaker one, and returns the row-filtered df_feat.

Added in version 1.1.0.

Parameters:

df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature. Must contain the abs_auc statistic used as the deterministic tie-break.
df_parts (pd.DataFrame, shape (n_samples, n_parts), optional) – DataFrame with sequence parts. Used to build the feature matrix; not required if X is given.
df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].
max_cor (float, default=0.7) – Maximum absolute Pearson correlation [0-1] allowed between any two retained features. For each pair whose |corr| > max_cor, the feature with the lower abs_auc is dropped (and the higher-abs_auc one kept) — regardless of the input row order, because the method ranks by abs_auc internally. Lower max_cor to prune more aggressively.
X (array-like, shape (n_samples, n_features), optional) – Pre-computed feature matrix. Column i must correspond to the feature in row i of the df_feat you pass (same order); the method then re-ranks df_feat and X together by abs_auc internally, so you do not pre-sort. If given, it is used directly and df_parts / df_scales are ignored (e.g. to reuse a matrix or to prune a CPP.run_num() df_feat).
accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).
n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_feat – Feature DataFrame filtered to a non-redundant subset (sorted by descending abs_auc), with a reset index.

Return type:

pd.DataFrame, shape (n_selected_features, n_feature_info)

Notes

Tie-break / determinism: features are sorted by descending abs_auc (ties broken by abs_mean_dif) before pruning, so for every correlated pair the lower-abs_auc feature is the one removed. This makes the output independent of the input row order and byte-identical across runs; the returned df_feat is in descending-abs_auc order.
X alignment: if you pass a pre-computed X, its columns must be aligned to the df_feat rows you pass (column i = feature in row i); the method reorders both together, so a mis-aligned X would correlate the wrong features.
The retained set is guaranteed to contain no feature pair with |corr| > max_cor.
Constant (zero-variance) features have undefined correlation and are always retained here; run SequenceFeature.prune_by_variance() first to remove them.
A df_feat with fewer than two features is returned unchanged (nothing to compare).

See also

SequenceFeature.prune_by_variance() for the variance-pruning step that should precede this.
NumericalFeature.filter_correlation() for the underlying correlation primitive on a matrix.
TreeModel.select_features() for the model-based selection that follows pruning.

Examples

:meth:SequenceFeature.prune_by_correlation removes empirically redundant features: among features whose realized values are pairwise correlated beyond max_cor, it keeps the one with the higher abs_auc and drops the others. The correlation is measured over the actual samples, making it complementary to CPP’s in-run redundancy reduction (which compares scale vectors). Use it after :meth:SequenceFeature.prune_by_variance and before :meth:TreeModel.select_features.

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
# gamma-secretase geometry: the TMD (~20 aa) comes from each protein's tmd_start/tmd_stop,
# flanked by short juxtamembrane domains of 4 residues each.
df_parts = sf.get_df_parts(df_seq=df_seq, jmd_n_len=4, jmd_c_len=4)
df_feat = aa.CPP(df_parts=df_parts).run(labels=labels, n_filter=50)
print(f"features from CPP.run: {len(df_feat)}")

[94mCPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.[0m
features from CPP.run: 50

At max_cor=0.7 no retained pair of features has absolute correlation above 0.7. The output is sorted by descending abs_auc and is deterministic across runs:

df_cor = sf.prune_by_correlation(df_feat=df_feat, df_parts=df_parts, max_cor=0.7)
print(f"kept {len(df_cor)} of {len(df_feat)} features")
aa.display_df(df_cor, n_rows=10, show_shape=True)

kept 12 of 50 features
DataFrame shape: (12, 13)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions
1	TMD-Pattern(C,4,8)-BEGF750101	Conformation	α-helix	α-helix	Conformational ...in-Dirkx, 1975)	0.444000	0.299000	0.299000	0.090000	0.196000	0.000002	0.045706	23,27
2	TMD_C_JMD_C-Pat...5,8)-ZIMJ680104	Energy	Isoelectric point	Isoelectric point	Isoelectric poi...n et al., 1968)	0.402000	0.130000	0.130000	0.080000	0.088000	0.000013	0.005435	33,36,39
3	TMD_C_JMD_C-Seg...,11)-AURR980119	Conformation	α-helix (C-term, out)	α-helix (C-terminal, outside)	Normalized posi...ora-Rose, 1998)	0.392000	0.198000	0.198000	0.106000	0.178000	0.000022	0.005822	39,40
4	TMD-Segment(10,11)-CHAM820102	Polarity	Hydrophobicity (interface)	Free energy (interface)	Free energy of ...-Charton, 1982)	0.389000	0.169000	-0.169000	0.050000	0.130000	0.000026	0.005837	27,28
5	TMD-Pattern(C,4...,11)-FUKS010106	Composition	Membrane proteins (MPs)	Proteins of mesophiles (INT)	Interior compos...ishikawa, 2001)	0.384000	0.219000	0.219000	0.166000	0.122000	0.000033	0.006680	20,23,27
6	TMD_C_JMD_C-Pat...,12)-AURR980108	Conformation	α-helix	α-helix (N-terminal, inside)	Normalized posi...ora-Rose, 1998)	0.371000	0.165000	0.165000	0.087000	0.115000	0.000059	0.005589	21,25,28,32
7	TMD_C_JMD_C-Pat...,12)-SNEP660101	Others	PC 1	Principal Component 1 (Sneath)	Principal compo... (Sneath, 1966)	0.371000	0.150000	0.150000	0.092000	0.109000	0.000059	0.006001	29,33,37
8	TMD-Segment(11,12)-GUYH850105	ASA/Volume	Accessible surface area (ASA)	Partition energy	Apparent partit...dex (Guy, 1985)	0.371000	0.131000	-0.131000	0.050000	0.144000	0.000059	0.005720	27,28
9	TMD_C_JMD_C-Pat...5,8)-AURR980110	Conformation	α-helix	α-helix (middle)	Normalized posi...ora-Rose, 1998)	0.368000	0.187000	0.187000	0.145000	0.124000	0.000070	0.005918	33,36,39
10	TMD_C_JMD_C-Pat...,11)-ROBB760113	Conformation	β-turn	β-turn	Information mea...n-Suzuki, 1976)	0.361000	0.324000	-0.324000	0.119000	0.268000	0.000093	0.005763	30,33

As with variance pruning, df_scales / accept_gaps / n_jobs configure the matrix build, and a pre-computed X (aligned column-for-column with df_feat) can be supplied instead of df_parts:

df_scales = aa.load_scales()
df_cor_full = sf.prune_by_correlation(df_feat=df_feat, df_parts=df_parts, df_scales=df_scales,
                                      max_cor=0.5, accept_gaps=True, n_jobs=1)
X = sf.feature_matrix(features=df_feat, df_parts=df_parts)
df_cor_X = sf.prune_by_correlation(df_feat=df_feat, X=X, max_cor=0.5)
print(f"via df_parts+params: {len(df_cor_full)} | via pre-computed X: {len(df_cor_X)}")

via df_parts+params: 3 | via pre-computed X: 3

The two pruners compose, and the result drops straight into model-based selection (:meth:TreeModel.select_features):

df_pruned = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, threshold=0.0)
df_pruned = sf.prune_by_correlation(df_feat=df_pruned, df_parts=df_parts, max_cor=0.5)
print(f"variance -> correlation kept {len(df_pruned)} of {len(df_feat)} features")
aa.display_df(df_pruned, n_rows=10, show_shape=True)

variance -> correlation kept 3 of 50 features
DataFrame shape: (3, 13)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions
1	TMD-Pattern(C,4,8)-BEGF750101	Conformation	α-helix	α-helix	Conformational ...in-Dirkx, 1975)	0.444000	0.299000	0.299000	0.090000	0.196000	0.000002	0.045706	23,27
2	TMD_C_JMD_C-Pat...5,8)-ZIMJ680104	Energy	Isoelectric point	Isoelectric point	Isoelectric poi...n et al., 1968)	0.402000	0.130000	0.130000	0.080000	0.088000	0.000013	0.005435	33,36,39
3	TMD_C_JMD_C-Pat...6,9)-GEIM800103	Conformation	Unclassified (Conformation)	α-helix (β-proteins)	Alpha-helix ind...-Roberts, 1980)	0.359000	0.163000	-0.163000	0.137000	0.080000	0.000104	0.005877	32,35,39

What can go wrong? A df_feat with fewer than two features has nothing to compare and is returned unchanged:

one = df_feat.head(1)
print(f"single-feature input -> {len(sf.prune_by_correlation(df_feat=one, df_parts=df_parts))} feature")

single-feature input -> 1 feature