SequenceFeature.prune_by_variance

SequenceFeature.prune_by_variance(df_feat=None, df_parts=None, df_scales=None, threshold=0.0, X=None, accept_gaps=False, n_jobs=1)[source]

Prune near-constant features from a feature DataFrame by variance.

Model-free feature pruning step: drops every feature whose column variance in the realized feature matrix (built from df_parts, or supplied directly via X) is at or below threshold, and returns the row-filtered df_feat. Use it on a fitted df_feat (e.g. from CPP.run()) as the first reduction stage, before SequenceFeature.prune_by_correlation() and TreeModel.select_features().

This is distinct from CPP’s in-run pre-filter (which screens candidate features by the test-group standard deviation): pruning measures variance over all samples of the already-selected features.

Added in version 1.1.0.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • df_parts (pd.DataFrame, shape (n_samples, n_parts), optional) – DataFrame with sequence parts. Used to build the feature matrix; not required if X is given.

  • df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from load_scales() unless specified in options['df_scales'].

  • threshold (float, default=0.0) – Minimum population variance (numpy var, ddof=0) a feature’s column must exceed to be kept; the threshold is in variance units, not standard-deviation units. Feature values are means of (typically [0, 1]-normalized) scale values, so variances are small — commonly below 0.1 — and a useful range is: 0.0 removes only strictly constant features, while ~0.01 to ~0.05 also prunes low-variance features. The variance is computed over all provided samples (every row of df_parts / X, both classes together) per feature column — not the test group only, and not per split.

  • X (array-like, shape (n_samples, n_features), optional) – Pre-computed feature matrix. Column i must correspond to the feature in row i of df_feat (same order). If given, it is used directly and df_parts / df_scales are ignored (e.g. to reuse a matrix across pruning calls or to prune a CPP.run_num() df_feat).

  • accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if True).

  • n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_feat – Feature DataFrame filtered to the features with variance above threshold, with a reset index.

Return type:

pd.DataFrame, shape (n_selected_features, n_feature_info)

Notes

  • Variance metric: population variance (ddof=0) of each feature column over all samples. A feature that is constant across the samples (zero peak-to-peak range) is treated as exactly zero variance, so threshold=0.0 removes precisely the constant features even when floating point would otherwise leave a tiny non-zero variance.

  • Scope: variance reflects how much a feature varies across your samples; it is unrelated to CPP’s in-run pre-filter, which screens candidate features by the test-group standard deviation (max_std_test) rather than the spread over all samples.

  • Recommended pruning order: variance (this method) -> correlation (SequenceFeature.prune_by_correlation()) -> TreeModel.select_features().

  • A pruning that retains no feature (e.g. threshold above every feature’s variance) raises a ValueError rather than returning an empty DataFrame.

See also

Examples

Feature pruning is the model-free, post-hoc reduction of a df_feat (here from :meth:CPP.run on the gamma-secretase DOM_GSEC dataset) before model-based selection. :meth:SequenceFeature.prune_by_variance drops near-constant features — those whose feature values barely change across the samples — measured on the feature matrix that :meth:SequenceFeature.feature_matrix builds from df_parts. Recommended order: variance -> correlation -> select_features.

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
# gamma-secretase geometry: the TMD (~20 aa) comes from each protein's tmd_start/tmd_stop,
# flanked by short juxtamembrane domains of 4 residues each.
df_parts = sf.get_df_parts(df_seq=df_seq, jmd_n_len=4, jmd_c_len=4)
df_feat = aa.CPP(df_parts=df_parts).run(labels=labels, n_filter=50)
print(f"features from CPP.run: {len(df_feat)}")
CPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
features from CPP.run: 50

With the default threshold=0.0 only strictly constant features are removed (population variance over all samples; threshold is in variance units). The result is a row-filtered df_feat with the same schema:

df_var = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, threshold=0.0)
print(f"kept {len(df_var)} of {len(df_feat)} features")
aa.display_df(df_var, n_rows=10, show_shape=True)
kept 50 of 50 features
DataFrame shape: (50, 13)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions
1 TMD-Pattern(C,4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.045706 23,27
2 TMD_C_JMD_C-Pat...4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.015235 24,28
3 TMD_C_JMD_C-Pat...,12)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.431000 0.251000 -0.251000 0.088000 0.143000 0.000003 0.005935 29,33,37
4 TMD_C_JMD_C-Pat...,12)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.431000 0.251000 -0.251000 0.088000 0.143000 0.000003 0.005564 24,28,32
5 TMD_C_JMD_C-Pat...,12)-MUNV940102 Energy Free energy (folding) Free energy (α-helix) Free energy in ...-Serrano, 1994) 0.422000 0.148000 -0.148000 0.056000 0.097000 0.000005 0.003512 29,33,37
6 TMD_C_JMD_C-Pat...,12)-MUNV940102 Energy Free energy (folding) Free energy (α-helix) Free energy in ...-Serrano, 1994) 0.422000 0.148000 -0.148000 0.056000 0.097000 0.000005 0.003902 24,28,32
7 TMD_C_JMD_C-Pat...5,8)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005435 33,36,39
8 TMD_C_JMD_C-Pat...,14)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005359 28,31,34
9 TMD_C_JMD_C-Seg...,11)-AURR980119 Conformation α-helix (C-term, out) α-helix (C-terminal, outside) Normalized posi...ora-Rose, 1998) 0.392000 0.198000 0.198000 0.106000 0.178000 0.000022 0.005822 39,40
10 TMD_C_JMD_C-Seg...,15)-AURR980119 Conformation α-helix (C-term, out) α-helix (C-terminal, outside) Normalized posi...ora-Rose, 1998) 0.392000 0.198000 0.198000 0.106000 0.178000 0.000022 0.005769 38

The remaining parameters control how the feature matrix is built or let you reuse one. A custom df_scales overrides the default scale set; accept_gaps tolerates gapped parts; n_jobs parallelizes the build. A pre-computed X (column i = feature in row i) skips the build entirely and also covers a :meth:CPP.run_num df_feat:

df_scales = aa.load_scales()
df_var_full = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, df_scales=df_scales,
                                   threshold=0.005, accept_gaps=True, n_jobs=1)
X = sf.feature_matrix(features=df_feat, df_parts=df_parts)
df_var_X = sf.prune_by_variance(df_feat=df_feat, X=X, threshold=0.005)
print(f"via df_parts+params: {len(df_var_full)} | via pre-computed X: {len(df_var_X)}")
via df_parts+params: 50 | via pre-computed X: 50

What can go wrong? A threshold above every feature’s variance would remove all features, so it raises a ValueError rather than returning an empty table:

try:
    sf.prune_by_variance(df_feat=df_feat, X=X, threshold=1e6)
except ValueError as e:
    print(f"ValueError: {e}")
ValueError: 'threshold' (1000000.0) removed all features. Lower it to retain features.