SequenceFeature.prune_by_variance
- SequenceFeature.prune_by_variance(df_feat=None, df_parts=None, df_scales=None, threshold=0.0, X=None, accept_gaps=False, n_jobs=1)[source]
Prune near-constant features from a feature DataFrame by variance.
Model-free feature pruning step: drops every feature whose column variance in the realized feature matrix (built from
df_parts, or supplied directly viaX) is at or belowthreshold, and returns the row-filtereddf_feat. Use it on a fitteddf_feat(e.g. fromCPP.run()) as the first reduction stage, beforeSequenceFeature.prune_by_correlation()andTreeModel.select_features().This is distinct from CPP’s in-run pre-filter (which screens candidate features by the test-group standard deviation): pruning measures variance over all samples of the already-selected features.
Added in version 1.1.0.
- Parameters:
df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
df_parts (pd.DataFrame, shape (n_samples, n_parts), optional) – DataFrame with sequence parts. Used to build the feature matrix; not required if
Xis given.df_scales (pd.DataFrame, shape (n_letters, n_scales), optional) – DataFrame of scales with letters typically representing amino acids. Default from
load_scales()unless specified inoptions['df_scales'].threshold (float, default=0.0) – Minimum population variance (
numpyvar,ddof=0) a feature’s column must exceed to be kept; the threshold is in variance units, not standard-deviation units. Feature values are means of (typically[0, 1]-normalized) scale values, so variances are small — commonly below0.1— and a useful range is:0.0removes only strictly constant features, while~0.01to~0.05also prunes low-variance features. The variance is computed over all provided samples (every row ofdf_parts/X, both classes together) per feature column — not the test group only, and not per split.X (array-like, shape (n_samples, n_features), optional) – Pre-computed feature matrix. Column
imust correspond to the feature in rowiofdf_feat(same order). If given, it is used directly anddf_parts/df_scalesare ignored (e.g. to reuse a matrix across pruning calls or to prune aCPP.run_num()df_feat).accept_gaps (bool, default=False) – Whether to accept missing values by enabling omitting for computations (if
True).n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.
- Returns:
df_feat – Feature DataFrame filtered to the features with variance above
threshold, with a reset index.- Return type:
pd.DataFrame, shape (n_selected_features, n_feature_info)
Notes
Variance metric: population variance (
ddof=0) of each feature column over all samples. A feature that is constant across the samples (zero peak-to-peak range) is treated as exactly zero variance, sothreshold=0.0removes precisely the constant features even when floating point would otherwise leave a tiny non-zero variance.Scope: variance reflects how much a feature varies across your samples; it is unrelated to CPP’s in-run pre-filter, which screens candidate features by the test-group standard deviation (
max_std_test) rather than the spread over all samples.Recommended pruning order: variance (this method) -> correlation (
SequenceFeature.prune_by_correlation()) ->TreeModel.select_features().A pruning that retains no feature (e.g.
thresholdabove every feature’s variance) raises aValueErrorrather than returning an empty DataFrame.
See also
SequenceFeature.prune_by_correlation()for the complementary redundancy-pruning step.SequenceFeature.feature_matrix()for the feature matrix that variance is computed over.TreeModel.select_features()for the model-based selection that follows pruning.
Examples
Feature pruning is the model-free, post-hoc reduction of a
df_feat(here from :meth:CPP.runon the gamma-secretaseDOM_GSECdataset) before model-based selection. :meth:SequenceFeature.prune_by_variancedrops near-constant features — those whose feature values barely change across the samples — measured on the feature matrix that :meth:SequenceFeature.feature_matrixbuilds fromdf_parts. Recommended order: variance -> correlation -> select_features.import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC", n=20) labels = df_seq["label"].to_list() sf = aa.SequenceFeature() # gamma-secretase geometry: the TMD (~20 aa) comes from each protein's tmd_start/tmd_stop, # flanked by short juxtamembrane domains of 4 residues each. df_parts = sf.get_df_parts(df_seq=df_seq, jmd_n_len=4, jmd_c_len=4) df_feat = aa.CPP(df_parts=df_parts).run(labels=labels, n_filter=50) print(f"features from CPP.run: {len(df_feat)}")
[94mCPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.[0m features from CPP.run: 50
With the default
threshold=0.0only strictly constant features are removed (population variance over all samples;thresholdis in variance units). The result is a row-filtereddf_featwith the same schema:df_var = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, threshold=0.0) print(f"kept {len(df_var)} of {len(df_feat)} features") aa.display_df(df_var, n_rows=10, show_shape=True)
kept 50 of 50 features DataFrame shape: (50, 13)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions 1 TMD-Pattern(C,4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.045706 23,27 2 TMD_C_JMD_C-Pat...4,8)-BEGF750101 Conformation α-helix α-helix Conformational ...in-Dirkx, 1975) 0.444000 0.299000 0.299000 0.090000 0.196000 0.000002 0.015235 24,28 3 TMD_C_JMD_C-Pat...,12)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.431000 0.251000 -0.251000 0.088000 0.143000 0.000003 0.005935 29,33,37 4 TMD_C_JMD_C-Pat...,12)-CRAJ730103 Conformation β-turn β-turn Normalized freq...d et al., 1973) 0.431000 0.251000 -0.251000 0.088000 0.143000 0.000003 0.005564 24,28,32 5 TMD_C_JMD_C-Pat...,12)-MUNV940102 Energy Free energy (folding) Free energy (α-helix) Free energy in ...-Serrano, 1994) 0.422000 0.148000 -0.148000 0.056000 0.097000 0.000005 0.003512 29,33,37 6 TMD_C_JMD_C-Pat...,12)-MUNV940102 Energy Free energy (folding) Free energy (α-helix) Free energy in ...-Serrano, 1994) 0.422000 0.148000 -0.148000 0.056000 0.097000 0.000005 0.003902 24,28,32 7 TMD_C_JMD_C-Pat...5,8)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005435 33,36,39 8 TMD_C_JMD_C-Pat...,14)-ZIMJ680104 Energy Isoelectric point Isoelectric point Isoelectric poi...n et al., 1968) 0.402000 0.130000 0.130000 0.080000 0.088000 0.000013 0.005359 28,31,34 9 TMD_C_JMD_C-Seg...,11)-AURR980119 Conformation α-helix (C-term, out) α-helix (C-terminal, outside) Normalized posi...ora-Rose, 1998) 0.392000 0.198000 0.198000 0.106000 0.178000 0.000022 0.005822 39,40 10 TMD_C_JMD_C-Seg...,15)-AURR980119 Conformation α-helix (C-term, out) α-helix (C-terminal, outside) Normalized posi...ora-Rose, 1998) 0.392000 0.198000 0.198000 0.106000 0.178000 0.000022 0.005769 38 The remaining parameters control how the feature matrix is built or let you reuse one. A custom
df_scalesoverrides the default scale set;accept_gapstolerates gapped parts;n_jobsparallelizes the build. A pre-computedX(columni= feature in rowi) skips the build entirely and also covers a :meth:CPP.run_numdf_feat:df_scales = aa.load_scales() df_var_full = sf.prune_by_variance(df_feat=df_feat, df_parts=df_parts, df_scales=df_scales, threshold=0.005, accept_gaps=True, n_jobs=1) X = sf.feature_matrix(features=df_feat, df_parts=df_parts) df_var_X = sf.prune_by_variance(df_feat=df_feat, X=X, threshold=0.005) print(f"via df_parts+params: {len(df_var_full)} | via pre-computed X: {len(df_var_X)}")
via df_parts+params: 50 | via pre-computed X: 50
What can go wrong? A
thresholdabove every feature’s variance would remove all features, so it raises aValueErrorrather than returning an empty table:try: sf.prune_by_variance(df_feat=df_feat, X=X, threshold=1e6) except ValueError as e: print(f"ValueError: {e}")
ValueError: 'threshold' (1000000.0) removed all features. Lower it to retain features.