AnnotationPreprocessor

class AnnotationPreprocessor(verbose=True)[source]

Bases: object

Preprocessing class ([pro], requires aaanalysis[pro]) for per-residue post-translational modification (PTM) / functional-site annotations.

Collects per-residue annotations — fetched from UniProt (fetch_uniprot()) or ingested from a user / predictor table (ingest()) — into one canonical df_annot schema, then encodes them into the [0, 1]-normalized per-residue dict_num consumed by CPP.run_num() (via encode()). Annotations fall into two top-level categories: a closed UniProt 'PTMs' vocabulary and an open 'Functional sites' vocabulary that user keys extend (register_feature()). A secondary scale-based path (build_scales() / build_cat()) feeds the amino acid (AA)-scale CPP.run(), and to_df_seq() exports annotations as AAWindowSampler anchors.

Added in version 1.1.0.

Parameters:

verbose (bool)

Methods

build_cat(features[, dim_names_override])

Build the df_cat metadata frame for features (corpus-free).

build_scales(df_seq, dict_num, features[, ...])

Build df_scales by context-free per-amino acid (AA) averaging of the corpus.

encode(df_seq, df_annot, features[, ...])

Encode df_annot into a [0, 1]-normalized per-residue dict_num.

fetch_uniprot(df_seq[, features, evidence, ...])

Fetch UniProt features for every entry and map to df_annot.

ingest(df_user)

Ingest a user / predictor annotation table into df_annot.

register_feature(key[, subcategory, ...])

Register (or override) an open-vocabulary Functional-sites key.

to_df_seq(df_seq, df_annot, feature_type[, ...])

Project annotations onto df_seq for AAWindowSampler negative sampling.

__init__(verbose=True)[source]
Parameters:

verbose (bool, default=True) – If True, verbose outputs are enabled.

Notes

  • This is the annotation-side member of the per-residue dict_num family, alongside EmbeddingPreprocessor (protein language model (PLM) embeddings) and StructurePreprocessor (PDB / Define Secondary Structure of Proteins (DSSP) / AlphaFold). All three emit [0, 1]-normalized tensors that NumericalFeature.get_parts() slices into the per-part inputs of CPP.run_num(), and that stack along the D axis via aaanalysis.combine_dict_nums(). The accompanying (df_scales, df_cat) pair names the D dimensions.

  • df_annot is the canonical per-residue schema with columns protein_id, start, end, aa, feature_type, category, source, evidence, score, bond_id (positions are 1-based, UniProt-canonical frame).

  • Encoder values are normalized to [0, 1]; non-annotated in-coverage residues are 0.0; NaN marks genuinely unresolved positions.

  • Bond features (disulfide / cross-link) expand to two single-residue endpoints sharing a bond_id; cleavage P1 anchors come from SIGNAL / PROPEP / TRANSIT span ends, not from the SITE grab-bag.

  • Two methods have no StructurePreprocessor analog by design, not oversight: register_feature() is the surface of the open 'Functional sites' vocabulary (structure’s registry is closed), and to_df_seq() exports a seq-mode window-split because here an annotation is the window label (a structure feature never is).

See also

Examples

encode maps each annotation in df_annot onto the target df_seq[sequence] and returns a [0, 1]-normalized per-residue dict_num ({entry: (L, D)}) for CPP.run_num — one dimension per feature_type. The expected residue identity (aa) is checked at every position (on_mismatch='raise' by default, the off-by-isoform guard); annotated residues carry their score, in-coverage non-annotated residues are 0.0.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

ap = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = ap.ingest(df_user)

dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot,
                     features=['hotspot'])
arr = dict_num['AF_TINY']
print('shape (L, D):', arr.shape)
print('annotated positions (1-based):',
      list(np.where(arr[:, 0] > 0)[0] + 1))
shape (L, D): (30, 1)
annotated positions (1-based): [np.int64(3), np.int64(16)]

Stack the result with structure / embedding dict_nums via aa.combine_dict_nums, slice with NumericalFeature.get_parts, then run CPP.run_num.