Prediction tasks: which task, which workflow

Most users arrive with a biological question and one practical worry: which AAanalysis workflow solves it? This page is the map. It sorts protein-prediction tasks into a small taxonomy, then routes each one to its dataset, its CPP setup, and the classes that carry it out. Use it as the front door to the tutorials (which teach one function at a time) and the protocols (which walk an end-to-end workflow).

The two axes that actually define a task

It is tempting to organize tasks by biological scale alone — residue, domain, protein. That label is useful shorthand, and AAanalysis encodes it directly in the dataset name prefixes (AA_*, DOM_*, SEQ_*; see load_dataset). But the scale is only a proxy. What genuinely determines how you set CPP up are two axes:

  • Unit of comparison — the part CPP profiles. A fixed-length window (residue level), a part-set such as jmd_n / tmd / jmd_c (domain level), or the whole chain (protein level).

  • Reference construction — how the contrasting set is built. Labeled A-vs-B groups, non-site / non-cleaved windows, an unlabeled pool, or a composition-matched shuffled background. CPP always reads out a test group against a reference group, so a feature’s effect size (mean_dif) is read as test − reference.

The prediction level is the convenient label; these two axes are the substance. The table below leads with them.

The prediction-task table

AAanalysis prediction tasks — by unit of comparison and reference construction

Task

Unit of comparison

Reference construction

Dataset prefix

CPP strategy

Typical classes

Residue · single-residue

(e.g. a PTM or modified site)

One window centered on a residue (odd aa_window_size)

Site windows vs non-site windows (or an unlabeled residue pool)

AA_

Positional

AAWindowSampler, CPP, TreeModel

Residue · between-residues

(e.g. a cleavage / scissile bond)

One window spanning a bond P1│P1′ (even aa_window_size)

Cleaved windows vs non-cleaved windows

AA_

Positional

AAWindowSampler, CPP, TreeModel

Domain

(a defined sub-region)

A part-set from tmd_start / tmd_stop (jmd_n / tmd / jmd_c)

Labeled A-vs-B groups (e.g. substrate vs non-substrate)

DOM_

Compositional and positional

SequenceFeature, CPP, TreeModel

Protein

(the whole chain)

The whole chain as a single part

Labeled A-vs-B groups of proteins

SEQ_

Compositional

CPP, TreeModel

Determinant discovery

(cross-cutting; no prediction target)

Any unit — window, part-set, or chain

Two groups contrasted to surface what distinguishes them (interpreted via AAontology)

AA_ / DOM_ / SEQ_

Compositional or positional

CPP, CPPPlot

Design / engineering

(cross-cutting; inverts prediction)

A sequence profiled against a target CPP profile

A target / reference profile the sequence is moved toward (ΔCPP)

AA_ / DOM_ / SEQ_

Compositional or positional

AAMut, SeqMut

Relational / interaction

(scope boundary — not a level)

Interface segments only (a part-set on each partner)

In scope for interface segments only; pairwise contacts hand off

Interface segments via DOM_ / AA_

Positional (segments)

Out of scope: structure / PLM tooling

Reading the table

The three levels. The residue, domain, and protein levels map one-to-one onto the AA_ / DOM_ / SEQ_ dataset prefixes. “Protein level” is the user-facing name of the SEQ_ (sequence) prefix; sequence stays reserved for the amino-acid string itself. The residue level carries two sub-modes that differ only by window parity: single-residue (odd window, a site on a residue) and between-residues (even window, a bond between two residues). They share the windowing machinery, so they are sub-modes of one level rather than two levels.

The two cross-cutting rows. Determinant discovery and design / engineering are not levels — they apply at any level and run in opposite directions. Determinant discovery asks what physicochemically distinguishes two groups (CPP’s purest, most interpretable use, with no prediction target). Design / engineering inverts that: it measures how a mutation shifts a sequence’s CPP profile (ΔCPP) and steers a sequence toward a target. Both showcase the interpretability edge, so they are first-class rows.

The boundary row. Relational / interaction tasks — PPI interfaces and residue–residue contacts — are listed to be honest about the taxonomy’s edge. AAanalysis profiles interface segments; long-range pairwise contacts and PPI-pair prediction are out of scope and hand off to structure / PLM tooling. It is a documented boundary, not a fourth level.

CPP strategy in one line

The CPP strategy column is compositional vs positional, and it is not a parameter — it emerges from split_kws (the CPP argument that controls how each part is read). A single whole-part average (Segment(1,1)) is compositional (composition-like, position-agnostic); sub-segments, Pattern, or PeriodicPattern are positional. The strategy tracks the level: compositional suits the protein level, positional suits the residue level, and the domain level uses both. The deeper recipes live in the dedicated CPP-strategies guide.

Where to go next

See also

  • Run it: the minimal end-to-end notebook A minimal CPP analysis loads a domain-level dataset, runs CPP, and reads out the signature in a few cells.

  • Workflows: the protocols catalog turns each task in this table into a start-to-finish workflow.

  • Mechanics: the per-function tutorials cover CPP, SequenceFeature, AAclust, and the rest.

  • Vocabulary: every term used here is defined in the glossary.