AnnotationPreprocessor
- class AnnotationPreprocessor(verbose=True)[source]
Bases:
objectPreprocessing class ([pro], requires
aaanalysis[pro]) for per-residue post-translational modification (PTM) / functional-site annotations.Collects per-residue annotations — fetched from UniProt (
fetch_uniprot()) or ingested from a user / predictor table (ingest()) — into one canonicaldf_annotschema, then encodes them into the[0, 1]-normalized per-residuedict_numconsumed byCPP.run_num()(viaencode()). Annotations fall into two top-level categories: a closed UniProt'PTMs'vocabulary and an open'Functional sites'vocabulary that user keys extend (register_feature()). A secondary scale-based path (build_scales()/build_cat()) feeds the amino acid (AA)-scaleCPP.run(), andto_df_seq()exports annotations asAAWindowSampleranchors.Added in version 1.1.0.
- Parameters:
verbose (
bool)
Methods
build_cat(features[, dim_names_override])Build the
df_catmetadata frame forfeatures(corpus-free).build_scales(df_seq, dict_num, features[, ...])Build
df_scalesby context-free per-amino acid (AA) averaging of the corpus.encode(df_seq, df_annot, features[, ...])Encode
df_annotinto a[0, 1]-normalized per-residuedict_num.fetch_uniprot(df_seq[, features, evidence, ...])Fetch UniProt features for every entry and map to
df_annot.ingest(df_user)Ingest a user / predictor annotation table into
df_annot.register_feature(key[, subcategory, ...])Register (or override) an open-vocabulary Functional-sites key.
to_df_seq(df_seq, df_annot, feature_type[, ...])Project annotations onto
df_seqfor AAWindowSampler negative sampling.- __init__(verbose=True)[source]
- Parameters:
verbose (bool, default=True) – If
True, verbose outputs are enabled.
Notes
This is the annotation-side member of the per-residue
dict_numfamily, alongsideEmbeddingPreprocessor(protein language model (PLM) embeddings) andStructurePreprocessor(PDB / Define Secondary Structure of Proteins (DSSP) / AlphaFold). All three emit[0, 1]-normalized tensors thatNumericalFeature.get_parts()slices into the per-part inputs ofCPP.run_num(), and that stack along the D axis viaaaanalysis.combine_dict_nums(). The accompanying(df_scales, df_cat)pair names the D dimensions.df_annotis the canonical per-residue schema with columnsprotein_id, start, end, aa, feature_type, category, source, evidence, score, bond_id(positions are 1-based, UniProt-canonical frame).Encoder values are normalized to
[0, 1]; non-annotated in-coverage residues are0.0;NaNmarks genuinely unresolved positions.Bond features (disulfide / cross-link) expand to two single-residue endpoints sharing a
bond_id; cleavage P1 anchors come from SIGNAL / PROPEP / TRANSIT span ends, not from theSITEgrab-bag.Two methods have no
StructurePreprocessoranalog by design, not oversight:register_feature()is the surface of the open'Functional sites'vocabulary (structure’s registry is closed), andto_df_seq()exports a seq-mode window-split because here an annotation is the window label (a structure feature never is).
See also
StructurePreprocessor: the structure-side analog (PDB / DSSP / AlphaFold).EmbeddingPreprocessor: the PLM-embedding analog.aaanalysis.combine_dict_nums(): stitch multiple dict_nums.CPP.run_num(): the downstream consumer.
Examples
encodemaps each annotation indf_annotonto the targetdf_seq[sequence]and returns a[0, 1]-normalized per-residuedict_num({entry: (L, D)}) forCPP.run_num— one dimension perfeature_type. The expected residue identity (aa) is checked at every position (on_mismatch='raise'by default, the off-by-isoform guard); annotated residues carry theirscore, in-coverage non-annotated residues are0.0.import warnings import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') ap = aa.AnnotationPreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) # A small user/predictor table -> Functional sites (open vocabulary). df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'], ut.COL_START: [3, 16], ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'], ut.COL_SCORE: [0.92, 0.40]}) df_annot = ap.ingest(df_user) dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot, features=['hotspot']) arr = dict_num['AF_TINY'] print('shape (L, D):', arr.shape) print('annotated positions (1-based):', list(np.where(arr[:, 0] > 0)[0] + 1))
shape (L, D): (30, 1) annotated positions (1-based): [np.int64(3), np.int64(16)]
Stack the result with structure / embedding
dict_nums viaaa.combine_dict_nums, slice withNumericalFeature.get_parts, then runCPP.run_num.