Prediction tasks: which task, which workflow
Most users arrive with a biological question and one practical worry: which AAanalysis workflow solves it? This page is the map. It sorts protein-prediction tasks into a small taxonomy, then routes each one to its dataset, its CPP setup, and the classes that carry it out. Use it as the front door to the tutorials (which teach one function at a time) and the protocols (which walk an end-to-end workflow).
The two axes that actually define a task
It is tempting to organize tasks by biological scale alone — residue, domain,
protein. That label is useful shorthand, and AAanalysis encodes it directly in the
dataset name prefixes (AA_*, DOM_*, SEQ_*; see load_dataset).
But the scale is only a proxy. What genuinely
determines how you set CPP up are two axes:
Unit of comparison — the part CPP profiles. A fixed-length window (residue level), a part-set such as
jmd_n/tmd/jmd_c(domain level), or the whole chain (protein level).Reference construction — how the contrasting set is built. Labeled A-vs-B groups, non-site / non-cleaved windows, an unlabeled pool, or a composition-matched shuffled background. CPP always reads out a test group against a reference group, so a feature’s effect size (
mean_dif) is read as test − reference.
The prediction level is the convenient label; these two axes are the substance. The table below leads with them.
The prediction-task table
Task |
Unit of comparison |
Reference construction |
Dataset prefix |
CPP strategy |
Typical classes |
|---|---|---|---|---|---|
Residue · single-residue (e.g. a PTM or modified site) |
One window centered on a residue (odd |
Site windows vs non-site windows (or an unlabeled residue pool) |
|
Positional |
|
Residue · between-residues (e.g. a cleavage / scissile bond) |
One window spanning a bond |
Cleaved windows vs non-cleaved windows |
|
Positional |
|
Domain (a defined sub-region) |
A part-set from |
Labeled A-vs-B groups (e.g. substrate vs non-substrate) |
|
Compositional and positional |
|
Protein (the whole chain) |
The whole chain as a single part |
Labeled A-vs-B groups of proteins |
|
Compositional |
|
Determinant discovery (cross-cutting; no prediction target) |
Any unit — window, part-set, or chain |
Two groups contrasted to surface what distinguishes them (interpreted via AAontology) |
|
Compositional or positional |
|
Design / engineering (cross-cutting; inverts prediction) |
A sequence profiled against a target CPP profile |
A target / reference profile the sequence is moved toward ( |
|
Compositional or positional |
|
Relational / interaction (scope boundary — not a level) |
Interface segments only (a part-set on each partner) |
In scope for interface segments only; pairwise contacts hand off |
Interface segments via |
Positional (segments) |
Out of scope: structure / PLM tooling |
Reading the table
The three levels. The residue,
domain, and protein levels map
one-to-one onto the AA_ / DOM_ / SEQ_ dataset prefixes. “Protein level” is the user-facing name of the SEQ_
(sequence) prefix; sequence stays reserved for the amino-acid string itself.
The residue level carries two sub-modes that differ only by window parity:
single-residue (odd window, a site on a residue) and between-residues (even
window, a bond between two residues). They share the windowing machinery, so they
are sub-modes of one level rather than two levels.
The two cross-cutting rows. Determinant discovery and design / engineering are not levels — they apply at any level and run in opposite directions. Determinant discovery asks what physicochemically distinguishes two groups (CPP’s purest, most interpretable use, with no prediction target). Design / engineering inverts that: it measures how a mutation shifts a sequence’s CPP profile (ΔCPP) and steers a sequence toward a target. Both showcase the interpretability edge, so they are first-class rows.
The boundary row. Relational / interaction tasks — PPI interfaces and residue–residue contacts — are listed to be honest about the taxonomy’s edge. AAanalysis profiles interface segments; long-range pairwise contacts and PPI-pair prediction are out of scope and hand off to structure / PLM tooling. It is a documented boundary, not a fourth level.
CPP strategy in one line
The CPP strategy column is compositional vs positional, and it is not
a parameter — it emerges from split_kws (the CPP argument that controls how
each part is read). A single whole-part average
(Segment(1,1)) is compositional (composition-like, position-agnostic);
sub-segments, Pattern, or PeriodicPattern are positional. The strategy
tracks the level: compositional suits the protein level, positional suits the
residue level, and the domain level uses both. The deeper recipes live in the
dedicated CPP-strategies guide.
Where to go next
See also
Run it: the minimal end-to-end notebook A minimal CPP analysis loads a domain-level dataset, runs CPP, and reads out the signature in a few cells.
Workflows: the protocols catalog turns each task in this table into a start-to-finish workflow.
Mechanics: the per-function tutorials cover
CPP,SequenceFeature,AAclust, and the rest.Vocabulary: every term used here is defined in the glossary.