Glossary
The canonical vocabulary of AAanalysis. Terms are short and opinionated; use
them consistently so code, docstrings, and tutorials read the same way. Terms
defined here can be cross-referenced from anywhere in the documentation with
:term: (e.g. :term:`dict_num`).
Sequences & data objects
- df_seq
Sequence table — one row per protein or region. Columns:
entry,sequence,label, and, for domain-level tasks, the TMD boundstmd_start/tmd_stop.- df_parts
Wide table with one column per part (
tmd,jmd_n,jmd_c, …), produced bySequenceFeature.get_df_parts.- df_feat
Ranked feature table:
featureid,abs_auc,mean_dif,p_val,positions,scale,category.- entry
Unique identifier of a sequence — the key/index of df_seq.
- part
A named region of a sequence over which a split operates and a scale is averaged; the
PARTfield of a feature id (PART-SPLIT-SCALE). The default vocabulary is TMD-centric (jmd_n/tmd/jmd_cand composites); name parts after the prediction level when that fits better.- scale
A mapping from each amino acid to a real number — a physicochemical property. AAontology ships ~600 curated scales.
The CPP feature model
- feature
A
(Part × Split × Scale)triple, writtenPART-SPLIT-SCALE— the atomic, residue-grounded, interpretable unit that CPP ranks.- split
How a scale is read across a part: Segment (a contiguous block), Pattern (fixed positions counted from a terminus), or PeriodicPattern (periodic positions, e.g.
i, i+3/4for an α-helical face).- CPP
Comparative Physicochemical Profiling — discovers ranked
Part × Split × Scalefeatures that distinguish a test group from a reference group.- AAontology
A two-level taxonomy of amino-acid scales; CPP uses its categories to organize and rank features.
- compositional vs positional
Whether
split_kwsyields one whole-part average (compositional) or position-resolved sub-segments and patterns (positional). It is not a flag — it emerges from the chosen splits.
Prediction tasks
- prediction level
Residue (
AA_*), domain (DOM_*), or protein (SEQ_*) — the unit a task predicts at. A convenient proxy for the two axes that actually define a task: the unit of comparison and the reference construction.- unit of comparison
What CPP profiles for a task — a window (residue level), a part-set (domain level), or the whole chain (protein level). One of the two axes that define a use-case class.
- reference construction
How the contrasting set is built — labeled A-vs-B groups, non-site / non-cleaved windows, an unlabeled pool, or a composition-matched shuffled background. The second task-defining axis.
- test group
The set CPP profiles, contrasted against the reference group. A feature’s mean difference (
mean_dif) is computed as test − reference (abs_aucmeasures the separation magnitude). For multi-class, each class is the test group in turn versus the rest as the reference group.- reference group
The contrasting set a test group is profiled against — what reference construction produces.
CPP modes
- determinant discovery
Using CPP with no prediction target: contrast two groups to surface what physicochemically distinguishes them, interpreted via AAontology. CPP’s purest, most interpretable use.
- design / engineering
Inverting prediction: measure how a mutation shifts a sequence’s CPP feature profile (ΔCPP) and use that to steer a sequence toward a target profile (
AAMut/SeqMut). Deliberately model-free.- ΔCPP
The change in a sequence’s CPP feature values caused by a mutation — the model-free signal that design / engineering ranks and optimizes.
Numerical features
- dict_num
A mapping
{entry: ndarray (L, D)}of per-residue numerical values — e.g. protein-language-model embeddings, structural descriptors, or PTM annotations.- pseudo-scale
A single column of a dict_num treated like a scale, letting CPP profile any per-residue numerical signal, not just amino-acid scales.
- numerical CPP
CPP.run_num— the numerical-mode pipeline that profiles dict_num inputs (sliced to part-sets) instead of amino-acid scale look-ups. Generalizes CPP beyond physicochemical scales.
Models & explainability
Reducing features
- redundancy reduction
Clustering correlated amino-acid scales and keeping one representative per cluster (
AAclust) to obtain a redundancy-reduced scale set.- medoid
The representative scale of an
AAclustcluster; the redundancy-reduced set is the set of medoids.- feature selection
Choosing an informative subset of features — e.g. recursive feature elimination (RFE) inside
TreeModel.- feature pruning
Dropping correlated or uninformative features before modeling (
NumericalFeature.filter_correlation).- feature simplification
CPP.simplify— swapping features onto fewer, more interpretable scales without retraining.
PU learning
- PU labels
dPULearninput labels:1= positive,2= unlabeled. The output adds0= reliable negative.- reliable negative
An unlabeled sample that
dPULearnidentifies as confidently negative (output label0), drawn from the unlabeled pool.
Window sampling
- window
A fixed-length residue stretch sampled around a position of interest — the residue-level unit of comparison.
- P1 anchor
The source/anchor position of a window (the Schechter–Berger
P1), about which test and reference windows are defined.- reference window
A background window (non-site, shuffled, or distance-banded) contrasted against test windows.