find_features

find_features(labels, df_seq, search='balanced', simplify=True, model='svm', cv=5, metric='balanced_accuracy', selection_scope='global', kws=None, subcategories=None, top_n=None, label_test=1, label_ref=0, name_test='TEST', name_ref='REF', plot=True, random_state=None, n_jobs=None, verbose=False)[source]

Identify discriminating features in one call via a staged, interpretable CPP AutoML search.

The search is staged so its cost stays interpretable. Stage 1 cross-validates the full Cartesian Part × Split × Scale grid (at a reference n_filter) and ranks each axis by its marginal-mean impact; Stage 2 refines only the single highest-impact axis against n_filter (the others pinned at the stage optimum); Stage 3 refines the winning feature set with CPP.simplify() and recursive feature elimination. Selection is multi-objective: within each stage the Pareto-optimal-then-simplest configuration across all metric wins, scored by the average cross-validated performance of one or more model s. The winner is then ranked by tree-based importance and drawn as the CPP feature map. At search="fast" no search is run — the result is byte-identical to the explicit single-CPP path.

Warning

Experimental. This aaanalysis.pipe (ap) golden pipeline is under active development; its API and its reported cross-validation scores may change between minor releases without the usual deprecation cycle. By default (selection_scope="global") feature selection runs on the full labeled set, so the reported scores are an in-sample (optimistic) ranking signal rather than a held-out generalization estimate; pass selection_scope="fold" for the honest nested regime. Pin a version if you depend on the current behaviour.

Parameters:

labels (array-like, shape (n_samples,)) – Class labels for the samples (typically, 1=positive/test, 0=negative/reference).
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – Sequence DataFrame, row-aligned to labels. Required, because the Part regions are a swept lever and must be (re)built from the sequences via SequenceFeature.get_df_parts().
search (str, default="balanced") – Search effort: "fast" (single default configuration, no search), "balanced" (sweep the Split levers + Scale + symmetric JMD length n_jmd + n_filter), or "exhaustive" (also sweep the Part region set and the performance-ranked scale sets, with a finer grid and a wider n_jmd range).
simplify (bool, default=True) – If True, refine the winning feature set with CPP.simplify() (kept only if it is not Pareto-dominated).
model (str, estimator, or list, default="svm") – Selection model(s): "svm", "rf", "log_reg", a scikit-learn estimator, or a list of these. A list averages the cross-validated scores across models.
cv (int, default=5) – Number of cross-validation folds for the selection score, must be > 1.
metric (str or list of str, default="balanced_accuracy") – Cross-validation scoring metric(s). A list triggers multi-objective Pareto selection.
selection_scope ({"global", "fold"}, default="global") –
Where CPP feature selection happens relative to the cross-validation.
- "global" (default): CPP selects features on the full labeled set and the model is cross-validated on that fixed feature matrix. Fast, but the reported scores are an in-sample (optimistic) ranking signal — feature selection has seen the test fold.
- "fold": an honest nested regime — within every fold of every configuration score, CPP re-selects features on the train split only, the model is fit on the train features and scored on the held-out fold. This removes the selection leakage, so the df_eval scores are held-out generalization estimates (typically lower). It re-runs CPP per fold, so it is much more expensive; pair it with search="fast" / "balanced". The returned df_feat is always the winning configuration refit on all data (outer-CV semantics). Nesting applies to the configuration-selection scores (the Stage-1/2 grid and the "fast" single-configuration score); the winner’s second-step refinement (CPP.simplify() + recursive feature elimination) runs on all data in both scopes, so no refinement capability is lost in "fold" mode.
kws (dict, optional) – Bounded power-user overrides; each pins a swept lever to a single value (unknown keys raise). Recognized keys: n_explain, n_split_max (max Segment splits), len_max (max Pattern span), n_filter, n_jmd (the symmetric JMD length jmd_n_len = jmd_c_len), simplify_strategy, max_cor, max_overlap. For free peptides / short parts (no flanking context), pass kws={"n_jmd": 0} so no JMD is carved out; the search then uses TMD-only parts (the whole peptide is one part, rather than half-TMD fragments) and caps the swept ``n_split_max`` range to the shortest part length (deduped), with a UserWarning. The split config also auto-caps to the shortest part (Pattern / PeriodicPattern that cannot fit are dropped and n_split_max is clamped). On normal (long-part) inputs the range cap is a no-op. Lower n_split_max / len_max yourself to control which splits are used.
subcategories (list of str, optional) – AAontology subcategories to restrict the scale sets to. If None, all scales of the grade.
top_n (int, optional) – If given, keep only the top top_n features (after importance ranking).
label_test (int, default=1) – Class label of the test/positive group passed to CPP.run().
label_ref (int, default=0) – Class label of the reference/negative group passed to CPP.run().
name_test (str, default="TEST") – Display name of the test/positive group in the feature map.
name_ref (str, default="REF") – Display name of the reference/negative group in the feature map.
plot (bool, default=True) – If True, draw the CPP feature map (returned as ax) and the publication eval figures (attached as ax.eval); if False, draw nothing and return None.
random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are reproducible.
n_jobs (int, optional) – Number of CPU cores (>=1) for the sweep and feature-matrix builds. If None, the optimized number is used.
verbose (bool, default=False) – If True, verbose progress information is printed.

Returns:

df_feat (pd.DataFrame) – Feature DataFrame of the selected configuration in the canonical CPP schema, ranked by tree-based importance.
ax (matplotlib.axes.Axes or None) – The feature-map Axes if plot=True, else None. When a search was run, the publication eval figures are attached as ax.eval (a list of matplotlib.figure.Figure; empty for a single-configuration fast search) — see plot_eval().
df_eval (pd.DataFrame) – Per-configuration sweep table: the configuration descriptors, one <metric>_mean / <metric>_std column per metric, plus stage, is_pareto (Pareto-optimal within its stage), rank, and is_selected (the single winner).

See also

CPPGrid for the configuration sweep this pipeline drives.
CPP.run() and CPP.simplify() for the underlying feature engineering.
CPPPlot.feature_map() for the visualization.

Examples

The aaanalysis.pipe (ap) module provides high-level golden pipelines — stateless, one-call wrappers over the AAanalysis primitives. ap.find_features runs a staged, interpretable CPP AutoML search: Stage 1 cross-validates the full Cartesian Part × Split × Scale grid and ranks each axis by its marginal-mean impact; Stage 2 refines the single highest-impact axis against n_filter; Stage 3 refines the winning feature set (CPP.simplify + recursive feature elimination). Selection is multi-objective — within each stage the Pareto-optimal-then-simplest configuration across all metrics wins, scored by the averaged cross-validated performance of one or more models. It returns (df_feat, ax, df_eval).

import matplotlib.pyplot as plt
import aaanalysis as aa
import aaanalysis.pipe as ap

aa.options["verbose"] = False
aa.plot_settings()

df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()

aa.display_df(df_seq, n_rows=10, show_shape=True)

DataFrame shape: (40, 9)

	entry	gene	sequence	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	FXYD3	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MTDH	MAARSWQDELAQQAE...SPKQIKKKKKARRET	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	PMEPA1	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P53801	PTTG1IP	MAPGVARGPTPYWRL...GLFKEENPYARFENN	97	119	RWGVCWVNFE	ALIITMSVVGGTLLLGIAICCCC	CCRRKRSRKP
5	Q8IUW5	RELL1	MAPRALPGSAVLAAA...EVPATPVKRERSGTE	59	81	NDTGNGHPEY	IAYALVPVFFIMGLFGVLICHLL	KKKGYRCTTE
6	P01135	TGFA	MVPSAGQLALFALGI...LLKGRTACCHSETVV	99	121	AVVAASQKKQ	AITALVVVSIVALAVLIITCVLI	HCCQVRKHCE
7	O43914	TYROBP	MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK	42	64	DCSCSTVSPG	VLAGIVMGDLVLTVLIALAVYFL	GRLVPRGRGA
8	P05556	ITGB1	MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK	729	751	ENPECPTGPD	IIPIVAGVVAGIVLIGLALLLIW	KLLMIIHDRR
9	P16234	PDGFRA	MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL	527	549	VAPTLRSELT	VAAAVLVLLVIVIISLIVLVVIW	KQKPRYEIRW
10	P50895	BCAM	MEPPDAPAQARGAPR...SGGARGGSGGFGDEC	549	571	TVSPQTSQAG	VAVMAVAVSVGLLLLVVAVFYCV	RRKGGPCCRQ

Fast runs a single default configuration — no search — so the result is byte-identical to the explicit CPP chain. df_eval then holds one row (the single configuration with its cross-validated balanced_accuracy):

df_feat, ax, df_eval = ap.find_features(labels=labels, df_seq=df_seq, search="fast",
                                         plot=False, random_state=42, n_jobs=1)

aa.display_df(df_eval, n_rows=10, show_shape=True)

DataFrame shape: (1, 15)

	stage	list_parts	split_types	pattern_mode	n_split_max	scale	n_jmd	n_filter	n_features	selection_scope	balanced_accuracy_mean	balanced_accuracy_std	is_pareto	rank	is_selected
1	single	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern...PeriodicPattern	p1+p2	15	explain:30	10	100	81	global	0.950000	0.100000	True	1	True

By default (selection_scope="global") CPP selects features on the full labeled set, so the reported scores are an in-sample (optimistic) ranking signal. Pass selection_scope="fold" for the honest nested regime: within every fold of every configuration score, CPP re-selects features on the train split only and the model is scored on the held-out fold, so the df_eval scores are held-out (typically lower) generalization estimates. The returned df_feat is still the winner refit on all data.

df_feat, ax, df_eval = ap.find_features(labels=labels, df_seq=df_seq, search="fast",
                                         selection_scope="fold",
                                         plot=False, random_state=42, n_jobs=1)

aa.display_df(df_eval, n_rows=10, show_shape=True)

DataFrame shape: (1, 15)

	stage	list_parts	split_types	pattern_mode	n_split_max	scale	n_jmd	n_filter	n_features	selection_scope	balanced_accuracy_mean	balanced_accuracy_std	is_pareto	rank	is_selected
1	single	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern...PeriodicPattern	p1+p2	15	explain:30	10	100	81	fold	0.725000	0.145774	True	1	True

Balanced runs the staged search. df_eval is the per-stage sweep table — the sensitivity Stage-1 grid, the n_filter Stage-2 refinement of the dominant axis, and the refine rows — each carrying is_pareto (Pareto-optimal within its stage) and one is_selected winner (here the Split sweep is pinned via kws to keep the example quick):

df_feat, ax, df_eval = ap.find_features(labels=labels, df_seq=df_seq, search="balanced",
                                         kws={"n_explain": 30, "n_split_max": 15},
                                         plot=False, random_state=42, n_jobs=1)

aa.display_df(df_eval, n_rows=10, show_shape=True)

DataFrame shape: (22, 15)

	stage	list_parts	split_types	pattern_mode	n_split_max	scale	n_jmd	n_filter	n_features	selection_scope	balanced_accuracy_mean	balanced_accuracy_std	is_pareto	is_selected	rank
1	sensitivity	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern	p1	15	explain:30	10	150	150	global	0.975000	0.050000	True	False	1
2	n_filter	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern	p1	15	explain:30	10	125	125	global	0.975000	0.050000	True	False	2
3	n_filter	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern	p1	15	explain:30	10	150	150	global	0.975000	0.050000	True	False	3
4	refine	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern	p1	15	explain:30	10	125	101	global	0.975000	0.050000	True	False	4
5	refine	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern	p1	15	explain:30	10	125	63	global	0.975000	0.050000	True	True	5
6	sensitivity	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment	none	15	explain:30	10	150	150	global	0.950000	0.100000	False	False	6
7	sensitivity	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,PeriodicPattern	p2	15	explain:30	10	150	150	global	0.950000	0.061237	False	False	7
8	n_filter	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern	p1	15	explain:30	10	50	50	global	0.950000	0.100000	False	False	8
9	n_filter	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,Pattern	p1	15	explain:30	10	100	100	global	0.950000	0.100000	False	False	9
10	sensitivity	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment	none	15	explain:30	6	150	150	global	0.925000	0.150000	False	False	10

Selection is multi-objective. Pass several model s (their cross-validated scores are averaged) and several metric s (the winner is Pareto-optimal across them, then simplest). df_eval gains one <metric>_mean/_std column per metric and an is_pareto flag. cv sets the folds; simplify toggles the refinement; subcategories / top_n restrict the scales / the returned features; label_test / label_ref set the groups; exhaustive additionally sweeps the Part regions and the performance-ranked scale sets:

subs = sorted(aa.load_scales(name="scales_cat")["subcategory"].unique())[:15]

df_feat, ax, df_eval = ap.find_features(labels=labels, df_seq=df_seq, search="balanced",
                                         model=["svm", "rf"],
                                         metric=["balanced_accuracy", "f1"],
                                         cv=5, simplify=True,
                                         kws={"n_explain": 30, "n_split_max": 15, "n_filter": 25},
                                         subcategories=subs, top_n=15,
                                         label_test=1, label_ref=0,
                                         plot=False, random_state=42, n_jobs=1, verbose=False)

aa.display_df(df_eval[df_eval["is_pareto"]], n_rows=10, show_shape=True)

DataFrame shape: (3, 17)

	stage	list_parts	split_types	pattern_mode	n_split_max	scale	n_jmd	n_filter	n_features	selection_scope	balanced_accuracy_mean	balanced_accuracy_std	f1_mean	f1_std	is_pareto	is_selected	rank
1	sensitivity	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,PeriodicPattern	p2	15	explain:30	10	25	25	global	0.937500	0.100000	0.943889	0.090613	True	False	1
3	n_filter	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,PeriodicPattern	p2	15	explain:30	10	25	25	global	0.937500	0.100000	0.943889	0.090613	True	False	3
5	refine	tmd,jmd_n_tmd_n,tmd_c_jmd_c	Segment,PeriodicPattern	p2	15	explain:30	10	25	14	global	0.937500	0.100000	0.943889	0.090613	True	True	5

With plot=True the winning features are drawn as the CPP feature map (the returned ax), and — when a search was run — the sweep is decomposed into publication-ready eval figures attached as ax.eval: a series of 2D viridis heatmaps (the two most-informative axes per panel, the least as the slice), a marginal-impact panel, and an n_filter panel. name_test / name_ref label the two groups; a single plt.show() renders the feature map and every eval figure, and each figure can be saved individually for a paper (ax.eval[0].savefig(...)):

df_feat, ax, df_eval = ap.find_features(labels=labels, df_seq=df_seq, search="balanced",
                                         kws={"n_split_max": 15}, top_n=25, plot=True,
                                         name_test="SUBSTRATE", name_ref="NON-SUB",
                                         random_state=42, n_jobs=1)

print(f"feature map + {len(ax.eval)} publication eval figures (ax.eval)")
plt.show()

feature map + 6 publication eval figures (ax.eval)

../_images/ap_find_features_1_output_11_1.png

../_images/ap_find_features_2_output_11_2.png

../_images/ap_find_features_3_output_11_3.png

../_images/ap_find_features_4_output_11_4.png

../_images/ap_find_features_5_output_11_5.png

../_images/ap_find_features_6_output_11_6.png

../_images/ap_find_features_7_output_11_7.png