SequenceFeature.prune_by_variance

SequenceFeature.prune_by_variance(df_feat, df_parts=None, df_scales=None, threshold=0.0, X=None, accept_gaps=False, n_jobs=1)[source]

Prune near-constant features from a feature DataFrame by variance.

Model-free feature pruning step: drops every feature whose column variance in the realized feature matrix (built from df_parts, or supplied directly via X) is at or below threshold, and returns the row-filtered df_feat. Use it on a fitted df_feat (e.g. from CPP.run()) as the first reduction stage, before SequenceFeature.prune_by_correlation() and TreeModel.select_features().

This is distinct from CPP’s in-run pre-filter (which screens candidate features by the test-group standard deviation): pruning measures variance over all samples of the already-selected features.

Added in version 1.1.0.

Parameters:

df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
df_parts (pd.DataFrame, shape (n_samples, n_parts), optional) – DataFrame with sequence parts. Used to build the feature matrix; not required if X is given.
df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].
threshold (float, default=0.0) – Minimum population variance (numpy var, ddof=0) a feature’s column must exceed to be kept; the threshold is in variance units, not standard-deviation units. Feature values are means of (typically [0, 1]-normalized) scale values, so variances are small — commonly below 0.1 — and a useful range is: 0.0 removes only strictly constant features, while ~0.01 to ~0.05 also prunes low-variance features. The variance is computed over all provided samples (every row of df_parts / X, both classes together) per feature column — not the test group only, and not per split.
X (array-like, shape (n_samples, n_features), optional) – Pre-computed feature matrix. Column i must correspond to the feature in row i of df_feat (same order). If given, it is used directly and df_parts / df_scales are ignored (e.g. to reuse a matrix across pruning calls or to prune a CPP.run_num() df_feat).
accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).
n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_feat – Feature DataFrame filtered to the features with variance above threshold, with a reset index.

Return type:

pd.DataFrame, shape (n_selected_features, n_feature_info)

Notes

Variance metric: population variance (ddof=0) of each feature column over all samples. A feature that is constant across the samples (zero peak-to-peak range) is treated as exactly zero variance, so threshold=0.0 removes precisely the constant features even when floating point would otherwise leave a tiny non-zero variance.
Scope: variance reflects how much a feature varies across your samples; it is unrelated to CPP’s in-run pre-filter, which screens candidate features by the test-group standard deviation (max_std_test) rather than the spread over all samples.
Recommended pruning order: variance (this method) -> correlation (SequenceFeature.prune_by_correlation()) -> TreeModel.select_features().
A pruning that retains no feature (e.g. threshold above every feature’s variance) raises a ValueError rather than returning an empty DataFrame.

See also

SequenceFeature.prune_by_correlation() for the complementary redundancy-pruning step.
SequenceFeature.feature_matrix() for the feature matrix that variance is computed over.
TreeModel.select_features() for the model-based selection that follows pruning.

Examples

Feature pruning is the model-free, post-hoc reduction of a df_feat (here from :meth:CPP.run on the gamma-secretase DOM_GSEC dataset) before model-based selection. :meth:SequenceFeature.prune_by_variance drops near-constant features — those whose feature values barely change across the samples — measured on the feature matrix that :meth:SequenceFeature.feature_matrix builds from df_parts. Recommended order: variance -> correlation -> select_features.

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
# gamma-secretase geometry: the TMD (~20 aa) comes from each protein's tmd_start/tmd_stop,
# flanked by short juxtamembrane domains of 4 residues each.
df_parts = sf.get_df_parts(df_seq=df_seq, jmd_n_len=4, jmd_c_len=4)
df_feat = aa.CPP(df_parts=df_parts).run(labels=labels, n_filter=50)
print(f"features from CPP.run: {len(df_feat)}")

[94mCPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.[0m
features from CPP.run: 50

With the default threshold=0.0 only strictly constant features are removed (population variance over all samples; threshold is in variance units). The result is a row-filtered df_feat with the same schema:

df_var = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, threshold=0.0)
print(f"kept {len(df_var)} of {len(df_feat)} features")
aa.display_df(df_var, n_rows=10, show_shape=True)

kept 50 of 50 features
DataFrame shape: (50, 13)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_mann_whitney	p_val_fdr_bh	positions
1	TMD-Pattern(C,4,8)-BEGF750101	Conformation	α-helix	α-helix	Conformational ...in-Dirkx, 1975)	0.444000	0.299000	0.299000	0.090000	0.196000	0.000002	0.045706	23,27
2	TMD_C_JMD_C-Pat...4,8)-BEGF750101	Conformation	α-helix	α-helix	Conformational ...in-Dirkx, 1975)	0.444000	0.299000	0.299000	0.090000	0.196000	0.000002	0.015235	24,28
3	TMD_C_JMD_C-Pat...,12)-CRAJ730103	Conformation	β-turn	β-turn	Normalized freq...d et al., 1973)	0.431000	0.251000	-0.251000	0.088000	0.143000	0.000003	0.005935	29,33,37
4	TMD_C_JMD_C-Pat...,12)-CRAJ730103	Conformation	β-turn	β-turn	Normalized freq...d et al., 1973)	0.431000	0.251000	-0.251000	0.088000	0.143000	0.000003	0.005564	24,28,32
5	TMD_C_JMD_C-Pat...,12)-MUNV940102	Energy	Free energy (folding)	Free energy (α-helix)	Free energy in ...-Serrano, 1994)	0.422000	0.148000	-0.148000	0.056000	0.097000	0.000005	0.003512	29,33,37
6	TMD_C_JMD_C-Pat...,12)-MUNV940102	Energy	Free energy (folding)	Free energy (α-helix)	Free energy in ...-Serrano, 1994)	0.422000	0.148000	-0.148000	0.056000	0.097000	0.000005	0.003902	24,28,32
7	TMD_C_JMD_C-Pat...5,8)-ZIMJ680104	Energy	Isoelectric point	Isoelectric point	Isoelectric poi...n et al., 1968)	0.402000	0.130000	0.130000	0.080000	0.088000	0.000013	0.005435	33,36,39
8	TMD_C_JMD_C-Pat...,14)-ZIMJ680104	Energy	Isoelectric point	Isoelectric point	Isoelectric poi...n et al., 1968)	0.402000	0.130000	0.130000	0.080000	0.088000	0.000013	0.005359	28,31,34
9	TMD_C_JMD_C-Seg...,11)-AURR980119	Conformation	α-helix (C-term, out)	α-helix (C-terminal, outside)	Normalized posi...ora-Rose, 1998)	0.392000	0.198000	0.198000	0.106000	0.178000	0.000022	0.005822	39,40
10	TMD_C_JMD_C-Seg...,15)-AURR980119	Conformation	α-helix (C-term, out)	α-helix (C-terminal, outside)	Normalized posi...ora-Rose, 1998)	0.392000	0.198000	0.198000	0.106000	0.178000	0.000022	0.005769	38

The remaining parameters control how the feature matrix is built or let you reuse one. A custom df_scales overrides the default scale set; accept_gaps tolerates gapped parts; n_jobs parallelizes the build. A pre-computed X (column i = feature in row i) skips the build entirely and also covers a :meth:CPP.run_num df_feat:

df_scales = aa.load_scales()
df_var_full = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, df_scales=df_scales,
                                   threshold=0.005, accept_gaps=True, n_jobs=1)
X = sf.feature_matrix(features=df_feat, df_parts=df_parts)
df_var_X = sf.prune_by_variance(df_feat=df_feat, X=X, threshold=0.005)
print(f"via df_parts+params: {len(df_var_full)} | via pre-computed X: {len(df_var_X)}")

via df_parts+params: 50 | via pre-computed X: 50

What can go wrong? A threshold above every feature’s variance would remove all features, so it raises a ValueError rather than returning an empty table:

try:
    sf.prune_by_variance(df_feat=df_feat, X=X, threshold=1e6)
except ValueError as e:
    print(f"ValueError: {e}")

ValueError: 'threshold' (1000000.0) removed all features. Lower it to retain features.