AnnotationPreprocessor

class AnnotationPreprocessor(verbose=True)[source]

Bases: object

Preprocessing class ([pro], requires aaanalysis[pro]) for per-residue post-translational modification (PTM) / functional-site annotations.

Collects per-residue annotations — fetched from UniProt (fetch_uniprot()) or ingested from a user / predictor table (ingest()) — into one canonical df_annot schema, then encodes them into the [0, 1]-normalized per-residue dict_num consumed by CPP.run_num() (via encode()). Annotations fall into two top-level categories: a closed UniProt 'PTMs' vocabulary and an open 'Functional sites' vocabulary that user keys extend (register_feature()). A secondary scale-based path (build_scales() / build_cat()) feeds the amino acid (AA)-scale CPP.run(), and to_df_seq() exports annotations as AAWindowSampler anchors.

Added in version 1.1.0.

Parameters:: verbose (bool)

Methods

`build_cat`(features[, dim_names_override])	Build the `df_cat` metadata frame for `features` (corpus-free).
`build_scales`(df_seq, dict_num, features[, ...])	Build `df_scales` by context-free per-amino acid (AA) averaging of the corpus.
`encode`(df_seq, df_annot, features[, ...])	Encode `df_annot` into a `[0, 1]`-normalized per-residue `dict_num`.
`fetch_uniprot`(df_seq[, features, evidence, ...])	Fetch UniProt features for every entry and map to `df_annot`.
`ingest`(df_user)	Ingest a user / predictor annotation table into `df_annot`.
`register_feature`(key[, subcategory, ...])	Register (or override) an open-vocabulary Functional-sites key.
`to_df_seq`(df_seq, df_annot, feature_type[, ...])	Project annotations onto `df_seq` for AAWindowSampler negative sampling.

__init__(verbose=True)[source]

Parameters:: verbose (bool, default=True) – If True, verbose outputs are enabled.

Notes

This is the annotation-side member of the per-residue dict_num family, alongside EmbeddingPreprocessor (protein language model (PLM) embeddings) and StructurePreprocessor (PDB / Define Secondary Structure of Proteins (DSSP) / AlphaFold). All three emit [0, 1]-normalized tensors that NumericalFeature.get_parts() slices into the per-part inputs of CPP.run_num(), and that stack along the D axis via aaanalysis.combine_dict_nums(). The accompanying (df_scales, df_cat) pair names the D dimensions.
df_annot is the canonical per-residue schema with columns protein_id, start, end, aa, feature_type, category, source, evidence, score, bond_id (positions are 1-based, UniProt-canonical frame).
Encoder values are normalized to [0, 1]; non-annotated in-coverage residues are 0.0; NaN marks genuinely unresolved positions.
Bond features (disulfide / cross-link) expand to two single-residue endpoints sharing a bond_id; cleavage P1 anchors come from SIGNAL / PROPEP / TRANSIT span ends, not from the SITE grab-bag.
Two methods have no StructurePreprocessor analog by design, not oversight: register_feature() is the surface of the open 'Functional sites' vocabulary (structure’s registry is closed), and to_df_seq() exports a seq-mode window-split because here an annotation is the window label (a structure feature never is).

See also

StructurePreprocessor: the structure-side analog (PDB / DSSP / AlphaFold).
EmbeddingPreprocessor: the PLM-embedding analog.
aaanalysis.combine_dict_nums(): stitch multiple dict_nums.
CPP.run_num(): the downstream consumer.

Examples

encode maps each annotation in df_annot onto the target df_seq[sequence] and returns a [0, 1]-normalized per-residue dict_num ({entry: (L, D)}) for CPP.run_num — one dimension per feature_type. The expected residue identity (aa) is checked at every position (on_mismatch='raise' by default, the off-by-isoform guard); annotated residues carry their score, in-coverage non-annotated residues are 0.0.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

annp = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = annp.ingest(df_user)

dict_num = annp.encode(df_seq=df_seq, df_annot=df_annot,
                     features=['hotspot'])
arr = dict_num['AF_TINY']
print('shape (L, D):', arr.shape)
print('annotated positions (1-based):',
      list(np.where(arr[:, 0] > 0)[0] + 1))

shape (L, D): (30, 1)
annotated positions (1-based): [np.int64(3), np.int64(16)]

# Further parameters: ``on_mismatch`` sets the off-by-isoform policy
# ('raise' | 'drop' | 'warn'), and ``return_df=True`` additionally returns a
# tidy long-form table of the mapped annotations.
dict_num, df_encoded = annp.encode(
    df_seq=df_seq, df_annot=df_annot, features=['hotspot'],
    on_mismatch='warn', return_df=True)
aa.display_df(df_encoded, n_rows=10, show_shape=True)

DataFrame shape: (1, 3)

	entry	sequence	encode_ok
1	AF_TINY	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	True

Stack the result with structure / embedding dict_nums via aa.combine_dict_nums, slice with NumericalFeature.get_parts, then run CPP.run_num.