find_features

find_features(labels, df_seq, search='balanced', simplify=True, model='svm', cv=5, metric='balanced_accuracy', kws=None, subcategories=None, top_n=None, label_test=1, label_ref=0, name_test='TEST', name_ref='REF', plot=True, random_state=None, n_jobs=None, verbose=False)[source]

Identify discriminating features in one call via a staged, interpretable CPP AutoML search.

The search is staged so its cost stays interpretable. Stage 1 cross-validates the full Cartesian Part × Split × Scale grid (at a reference n_filter) and ranks each axis by its marginal-mean impact; Stage 2 refines only the single highest-impact axis against n_filter (the others pinned at the stage optimum); Stage 3 refines the winning feature set with CPP.simplify() and recursive feature elimination. Selection is multi-objective: within each stage the Pareto-optimal-then-simplest configuration across all metric wins, scored by the average cross-validated performance of one or more model s. The winner is then ranked by tree-based importance and drawn as the CPP feature map. At search="fast" no search is run — the result is byte-identical to the explicit single-CPP path.

Parameters:
  • labels (array-like, shape (n_samples,)) – Class labels for the samples (typically, 1=positive/test, 0=negative/reference).

  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – Sequence DataFrame, row-aligned to labels. Required, because the Part regions are a swept lever and must be (re)built from the sequences via SequenceFeature.get_df_parts().

  • search (str, default="balanced") – Search effort: "fast" (single default configuration, no search), "balanced" (sweep the Split levers + Scale + symmetric JMD length n_jmd + n_filter), or "exhaustive" (also sweep the Part region set and the performance-ranked scale sets, with a finer grid and a wider n_jmd range).

  • simplify (bool, default=True) – If True, refine the winning feature set with CPP.simplify() (kept only if it is not Pareto-dominated).

  • model (str, estimator, or list, default="svm") – Selection model(s): "svm", "rf", "log_reg", a scikit-learn estimator, or a list of these. A list averages the cross-validated scores across models.

  • cv (int, default=5) – Number of cross-validation folds for the selection score, must be > 1.

  • metric (str or list of str, default="balanced_accuracy") – Cross-validation scoring metric(s). A list triggers multi-objective Pareto selection.

  • kws (dict, optional) – Bounded power-user overrides; each pins a swept lever to a single value (unknown keys raise). Recognized keys: n_explain, n_split_max, n_filter, n_jmd (the symmetric JMD length jmd_n_len = jmd_c_len), simplify_strategy, max_cor, max_overlap.

  • subcategories (list of str, optional) – AAontology subcategories to restrict the scale sets to. If None, all scales of the grade.

  • top_n (int, optional) – If given, keep only the top top_n features (after importance ranking).

  • label_test (int, default=1) – Class label of the test/positive group passed to CPP.run().

  • label_ref (int, default=0) – Class label of the reference/negative group passed to CPP.run().

  • name_test (str, default="TEST") – Display name of the test/positive group in the feature map.

  • name_ref (str, default="REF") – Display name of the reference/negative group in the feature map.

  • plot (bool, default=True) – If True, draw the CPP feature map (returned as ax) and the publication eval figures (attached as ax.eval); if False, draw nothing and return None.

  • random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are reproducible.

  • n_jobs (int, optional) – Number of CPU cores (>=1) for the sweep and feature-matrix builds. If None, the optimized number is used.

  • verbose (bool, default=False) – If True, verbose progress information is printed.

Returns:

  • df_feat (pd.DataFrame) – Feature DataFrame of the selected configuration in the canonical CPP schema, ranked by tree-based importance.

  • ax (matplotlib.axes.Axes or None) – The feature-map Axes if plot=True, else None. When a search was run, the publication eval figures are attached as ax.eval (a list of matplotlib.figure.Figure; empty for a single-configuration fast search) — see plot_eval().

  • df_eval (pd.DataFrame) – Per-configuration sweep table: the configuration descriptors, one <metric>_mean / <metric>_std column per metric, plus stage, is_pareto (Pareto-optimal within its stage), rank, and is_selected (the single winner).

See also

  • CPPGrid for the configuration sweep this pipeline drives.

  • CPP.run() and CPP.simplify() for the underlying feature engineering.

  • CPPPlot.feature_map() for the visualization.

Examples

The aaanalysis.pipe (aap) module provides high-level golden pipelines — stateless, one-call wrappers over the AAanalysis primitives. aap.find_features runs a staged, interpretable CPP AutoML search: Stage 1 cross-validates the full Cartesian Part × Split × Scale grid and ranks each axis by its marginal-mean impact; Stage 2 refines the single highest-impact axis against n_filter; Stage 3 refines the winning feature set (CPP.simplify + recursive feature elimination). Selection is multi-objective — within each stage the Pareto-optimal-then-simplest configuration across all metrics wins, scored by the averaged cross-validated performance of one or more models. It returns (df_feat, ax, df_eval).

import warnings
import matplotlib.pyplot as plt
import aaanalysis as aa
import aaanalysis.pipe as aap

aa.options["verbose"] = False
aa.plot_settings()
# Silence the small-demo-data 'n_filter' shortfall advisory (this 40-sequence toy set can't
# always supply the larger n_filter values after redundancy filtering; moot on real data).
warnings.filterwarnings("ignore", message=r"'n_filter'", category=RuntimeWarning)

df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()

aa.display_df(df_seq, n_rows=10, show_shape=True)
DataFrame shape: (40, 8)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP
5 Q8IUW5 MAPRALPGSAVLAAA...EVPATPVKRERSGTE 0 59 81 NDTGNGHPEY IAYALVPVFFIMGLFGVLICHLL KKKGYRCTTE
6 P01135 MVPSAGQLALFALGI...LLKGRTACCHSETVV 0 99 121 AVVAASQKKQ AITALVVVSIVALAVLIITCVLI HCCQVRKHCE
7 O43914 MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK 0 42 64 DCSCSTVSPG VLAGIVMGDLVLTVLIALAVYFL GRLVPRGRGA
8 P05556 MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK 0 729 751 ENPECPTGPD IIPIVAGVVAGIVLIGLALLLIW KLLMIIHDRR
9 P16234 MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL 0 527 549 VAPTLRSELT VAAAVLVLLVIVIISLIVLVVIW KQKPRYEIRW
10 P50895 MEPPDAPAQARGAPR...SGGARGGSGGFGDEC 0 549 571 TVSPQTSQAG VAVMAVAVSVGLLLLVVAVFYCV RRKGGPCCRQ

Fast runs a single default configuration — no search — so the result is byte-identical to the explicit CPP chain. df_eval then holds one row (the single configuration with its cross-validated balanced_accuracy):

df_feat, ax, df_eval = aap.find_features(labels=labels, df_seq=df_seq, search="fast",
                                         plot=False, random_state=42, n_jobs=1)

aa.display_df(df_eval, n_rows=10, show_shape=True)
DataFrame shape: (1, 14)
  stage list_parts split_types pattern_mode n_split_max scale n_jmd n_filter n_features balanced_accuracy_mean balanced_accuracy_std is_pareto rank is_selected
1 single tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern...PeriodicPattern p1+p2 15 explain:30 10 100 81 0.950000 0.100000 True 1 True

Balanced runs the staged search. df_eval is the per-stage sweep table — the sensitivity Stage-1 grid, the n_filter Stage-2 refinement of the dominant axis, and the refine rows — each carrying is_pareto (Pareto-optimal within its stage) and one is_selected winner (here the Split sweep is pinned via kws to keep the example quick):

df_feat, ax, df_eval = aap.find_features(labels=labels, df_seq=df_seq, search="balanced",
                                         kws={"n_explain": 30, "n_split_max": 15},
                                         plot=False, random_state=42, n_jobs=1)

aa.display_df(df_eval, n_rows=10, show_shape=True)
DataFrame shape: (22, 14)
  stage list_parts split_types pattern_mode n_split_max scale n_jmd n_filter n_features balanced_accuracy_mean balanced_accuracy_std is_pareto is_selected rank
1 sensitivity tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern p1 15 explain:30 10 150 150 0.975000 0.050000 True False 1
2 n_filter tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern p1 15 explain:30 10 125 125 0.975000 0.050000 True False 2
3 n_filter tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern p1 15 explain:30 10 150 150 0.975000 0.050000 True False 3
4 refine tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern p1 15 explain:30 10 125 101 0.975000 0.050000 True False 4
5 refine tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern p1 15 explain:30 10 125 61 0.975000 0.050000 True True 5
6 sensitivity tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment none 15 explain:30 10 150 150 0.950000 0.100000 False False 6
7 sensitivity tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,PeriodicPattern p2 15 explain:30 10 150 150 0.950000 0.061237 False False 7
8 n_filter tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern p1 15 explain:30 10 50 50 0.950000 0.100000 False False 8
9 n_filter tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,Pattern p1 15 explain:30 10 100 100 0.950000 0.100000 False False 9
10 sensitivity tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment none 15 explain:30 6 150 150 0.925000 0.150000 False False 10

Selection is multi-objective. Pass several model s (their cross-validated scores are averaged) and several metric s (the winner is Pareto-optimal across them, then simplest). df_eval gains one <metric>_mean/_std column per metric and an is_pareto flag. cv sets the folds; simplify toggles the refinement; subcategories / top_n restrict the scales / the returned features; label_test / label_ref set the groups; exhaustive additionally sweeps the Part regions and the performance-ranked scale sets:

subs = sorted(aa.load_scales(name="scales_cat")["subcategory"].unique())[:15]

df_feat, ax, df_eval = aap.find_features(labels=labels, df_seq=df_seq, search="balanced",
                                         model=["svm", "rf"],
                                         metric=["balanced_accuracy", "f1"],
                                         cv=5, simplify=True,
                                         kws={"n_explain": 30, "n_split_max": 15, "n_filter": 25},
                                         subcategories=subs, top_n=15,
                                         label_test=1, label_ref=0,
                                         plot=False, random_state=42, n_jobs=1, verbose=False)

aa.display_df(df_eval[df_eval["is_pareto"]], n_rows=10, show_shape=True)
DataFrame shape: (3, 16)
  stage list_parts split_types pattern_mode n_split_max scale n_jmd n_filter n_features balanced_accuracy_mean balanced_accuracy_std f1_mean f1_std is_pareto is_selected rank
1 refine tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,PeriodicPattern p2 15 explain:30 10 25 12 0.962500 0.075000 0.963889 0.072222 True True 1
2 sensitivity tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,PeriodicPattern p2 15 explain:30 10 25 25 0.937500 0.100000 0.943889 0.090613 True False 2
4 n_filter tmd,jmd_n_tmd_n,tmd_c_jmd_c Segment,PeriodicPattern p2 15 explain:30 10 25 25 0.937500 0.100000 0.943889 0.090613 True False 4

With plot=True the winning features are drawn as the CPP feature map (the returned ax), and — when a search was run — the sweep is decomposed into publication-ready eval figures attached as ax.eval: a series of 2D viridis heatmaps (the two most-informative axes per panel, the least as the slice), a marginal-impact panel, and an n_filter panel. name_test / name_ref label the two groups; a single plt.show() renders the feature map and every eval figure, and each figure can be saved individually for a paper (ax.eval[0].savefig(...)):

df_feat, ax, df_eval = aap.find_features(labels=labels, df_seq=df_seq, search="balanced",
                                         kws={"n_split_max": 15}, top_n=25, plot=True,
                                         name_test="SUBSTRATE", name_ref="NON-SUB",
                                         random_state=42, n_jobs=1)

print(f"feature map + {len(ax.eval)} publication eval figures (ax.eval)")
plt.show()
feature map + 6 publication eval figures (ax.eval)
../_images/aap_find_features_1_output_9_1.png ../_images/aap_find_features_2_output_9_2.png ../_images/aap_find_features_3_output_9_3.png ../_images/aap_find_features_4_output_9_4.png ../_images/aap_find_features_5_output_9_5.png ../_images/aap_find_features_6_output_9_6.png ../_images/aap_find_features_7_output_9_7.png