CPPGrid

class CPPGrid(df_seq, labels, dict_num=None, accept_gaps=False, verbose=True, random_state=None, n_jobs=-1, backend='threads')[source]

Bases: Tool

Grid-style sweep over Comparative Physicochemical Profiling (CPP) configurations (Tool) [Breimann25].

Runs the full parts → splits → scales → run pipeline across a Cartesian grid of configurations so a sweep needs one call instead of many manual get_df_parts / get_split_kws / CPP constructions. The dataset (df_seq + labels, plus dict_num for the numerical arm) is bound at construction; run() takes four stage-grouped parameter dictionaries whose list-valued entries are swept.

Added in version 1.1.0.

Notes

Inside each configuration CPP.run / run_num runs serially (n_jobs=1); the grid is parallelized across configurations to avoid nested oversubscription.
The default backend="threads" shares df_seq / df_scales in-process (no dataframe serialization, and it sidesteps the Python 3.14 / macOS __main__-guard spawn footgun). Pass backend="loky" for process-based parallelism.

After run(), the feature tables and the sweep summary are also kept on the instance as list_df_feat_ and df_params_ (aligned by row index), and eval() scores the configurations and returns them best-first.

See also

CPP: the per-configuration engine this class orchestrates.
eval(): score the swept configurations and rank them best-first.

Parameters:

df_seq (DataFrame)
labels (Union[Sequence[Union[int, float]], ndarray, Series])
dict_num (Optional[Dict[str, ndarray]])
accept_gaps (bool)
verbose (bool)
random_state (Optional[int])
n_jobs (Optional[int])
backend (Literal[‘threads’, ‘loky’])

Methods

`eval`([sort_by, ascending])	Score the swept configurations and return `df_params` joined to per-config quality, best-first.
`run`([params_parts, params_split, ...])	Run the configuration grid and return per-combo feature tables plus a sweep summary.

__init__(df_seq, labels, dict_num=None, accept_gaps=False, verbose=True, random_state=None, n_jobs=-1, backend='threads')[source]

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences (any format accepted by SequenceFeature.get_df_parts()).
labels (array-like, shape (n_samples,)) – Class labels aligned to the resulting df_parts rows (test vs reference).
dict_num (dict[str, np.ndarray], optional) – Mapping entry -> (L, D) per-residue tensor. If given, the grid runs the numerical arm (NumericalFeature.get_parts → CPP.run_num).
accept_gaps (bool, default=False) – Whether to accept gaps when assigning scale values.
verbose (bool, default=True) – If True, enable verbose output.
random_state (int, optional) – Seed forwarded to each CPP for reproducibility.
n_jobs (int, default=-1) – Number of workers used across configurations (-1 = all cores).
backend ({'threads', 'loky'}, default='threads') –
Joblib parallelization backend used across configurations:
- 'threads': shared-memory threading. df_seq / df_scales are shared in-process without serialization (the default), and it sidesteps the Python 3.14 / macOS process-spawn footgun.
- 'loky': process-based parallelism. Each configuration runs in a separate process — use when the per-configuration work is GIL-bound, at the cost of serializing the shared data to each worker.