Glossary

The canonical vocabulary of AAanalysis. Terms are short and opinionated; use them consistently so code, docstrings, and tutorials read the same way. Terms defined here can be cross-referenced from anywhere in the documentation with :term: (e.g. :term:`dict_num`).

Sequences & data objects

df_seq: Sequence table — one row per protein or region. Columns: entry, sequence, label, and, for domain-level tasks, the TMD bounds tmd_start / tmd_stop.
df_parts: Wide table with one column per part (tmd, jmd_n, jmd_c, …), produced by get_df_parts().
df_feat: Ranked feature table: feature id, abs_auc, mean_dif, p_val, positions, scale, category.
entry: Unique identifier of a sequence — the key/index of df_seq.
part: A named region of a sequence over which a split operates and a scale is averaged; the PART field of a feature id (PART-SPLIT-SCALE). The default vocabulary is TMD-centric (jmd_n / tmd / jmd_c and composites); name parts after the prediction level when that fits better.
scale: A mapping from each amino acid to a real number — a physicochemical property. AAontology ships ~600 curated scales.

The CPP feature model

feature: A (Part × Split × Scale) triple, written PART-SPLIT-SCALE — the atomic, residue-grounded, interpretable unit that CPP ranks.
split: How a scale is read across a part: Segment (a contiguous block), Pattern (fixed positions counted from a terminus), or PeriodicPattern (periodic positions, e.g. i, i+3/4 for an α-helical face).
CPP: Comparative Physicochemical Profiling — discovers ranked Part × Split × Scale features that distinguish a test group from a reference group.
AAontology: A two-level taxonomy of amino-acid scales; CPP uses its categories to organize and rank features.
compositional vs positional: Whether split_kws yields one whole-part average (compositional) or position-resolved sub-segments and patterns (positional). It is not a flag — it emerges from the chosen splits.

Prediction tasks

prediction level: Residue (AA_*), domain (DOM_*), or protein (SEQ_*) — the unit a task predicts at. A convenient proxy for the two axes that actually define a task: the unit of comparison and the reference construction.
residue level: Per-residue / windowed prediction (datasets AA_*); the unit of comparison is a fixed-length window (AAWindowSampler). Two sub-modes, differing only by window parity: single-residue (odd window — a site on a residue, e.g. a PTM) and between-residues (even window — a bond P1│P1′, e.g. cleavage).
domain level: Prediction over a defined sub-region (datasets DOM_*); the unit of comparison is the part-set derived from tmd_start / tmd_stop (jmd_n / tmd / jmd_c). CPP is native here.
protein level: Whole-chain prediction (datasets SEQ_*); the unit of comparison is the entire chain as one part. “Protein level” is the user-facing name of the SEQ_ (sequence) prefix — sequence stays reserved for the amino-acid string.
relational / interaction: Tasks about relationships between residues or chains (PPI interfaces, residue–residue contacts). AAanalysis profiles interface segments only; long-range pairwise contacts are out of scope and hand off to structure / PLM tooling. A documented scope boundary, not a fourth prediction level.
unit of comparison: What CPP profiles for a task — a window (residue level), a part-set (domain level), or the whole chain (protein level). One of the two axes that define a use-case class.
reference construction: How the contrasting set is built — labeled A-vs-B groups, non-site / non-cleaved windows, an unlabeled pool, or a composition-matched shuffled background. The second task-defining axis.
test group: The set CPP profiles, contrasted against the reference group. A feature’s mean difference (mean_dif) is computed as test − reference (abs_auc measures the separation magnitude). For multi-class, each class is the test group in turn versus the rest as the reference group.
reference group: The contrasting set a test group is profiled against — what reference construction produces.

CPP modes

determinant discovery: Using CPP with no prediction target: contrast two groups to surface what physicochemically distinguishes them, interpreted via AAontology. CPP’s purest, most interpretable use.
design / engineering: Inverting prediction: measure how a mutation shifts a sequence’s CPP feature profile (ΔCPP) and use that to steer a sequence toward a target profile (AAMut / SeqMut). Deliberately model-free.
ΔCPP: The change in a sequence’s CPP feature values caused by a mutation — the model-free signal that design / engineering ranks and optimizes.

Numerical features

dict_num: A mapping {entry: ndarray (L, D)} of per-residue numerical values — e.g. protein-language-model embeddings, structural descriptors, or PTM annotations.
pseudo-scale: A single column of a dict_num treated like a scale, letting CPP profile any per-residue numerical signal, not just amino-acid scales.
numerical CPP: run_num() — the numerical-mode pipeline that profiles dict_num inputs (sliced to part-sets) instead of amino-acid scale look-ups. Generalizes CPP beyond physicochemical scales.

Models & explainability

feature importance: An unsigned, group-level ranking of how much each feature contributes to a model (e.g. TreeModel Monte-Carlo importance).
feature impact: A signed, per-sample attribution of how each feature pushes a single prediction (e.g. ShapModel; visualized via shap_plot).

Reducing features

redundancy reduction: Clustering correlated amino-acid scales and keeping one representative per cluster (AAclust) to obtain a redundancy-reduced scale set.
medoid: The representative scale of an AAclust cluster; the redundancy-reduced set is the set of medoids.
feature selection: Choosing an informative subset of features — e.g. recursive feature elimination (RFE) inside TreeModel.
feature pruning: Dropping correlated or uninformative features before modeling (filter_correlation()).
feature simplification: simplify() — swapping features onto fewer, more interpretable scales without retraining.

PU learning

PU labels: dPULearn input labels: 1 = positive, 2 = unlabeled. The output adds 0 = reliable negative.
reliable negative: An unlabeled sample that dPULearn identifies as confidently negative (output label 0), drawn from the unlabeled pool.

Window sampling

window: A fixed-length residue stretch sampled around a position of interest — the residue-level unit of comparison.
P1 anchor: The source/anchor position of a window (the Schechter–Berger P1), about which test and reference windows are defined.
reference window: A background window (non-site, shuffled, or distance-banded) contrasted against test windows.

Class conventions

Wrapper: An scikit-learn-style class implementing .fit / .predict / .eval and setting trailing *_ attributes after fit.
Tool: A pipeline-style class implementing .run / .eval.
Plot class: A *Plot mirror of an analytical class — same arguments, visualization only (e.g. CPPPlot mirrors CPP).