aaanalysis.AnnotationPreprocessor

class aaanalysis.AnnotationPreprocessor(verbose=True)[source]

Bases: object

Preprocessing class for per-residue PTM / functional-site annotations [Breimann25a].

Mirrors EmbeddingPreprocessor’s instance-based shape but is the annotation-side companion: fetches (UniProt) or ingests (user / predictor) annotations and encodes them into the dict_num tensor that NumericalFeature.get_parts() slices into per-part inputs for CPP.run_num(), plus the (df_scales, df_cat) metadata pair.

Added in version 1.1.0.

Parameters:

verbose (bool)

Methods

build_cat([features, dim_names_override])

Build the df_cat metadata frame for features (corpus-free).

build_scales([df_seq, dict_num, features, ...])

Build df_scales by context-free per-AA averaging of the corpus.

encode([df_seq, df_annot, features, ...])

Encode df_annot into a [0, 1]-normalized per-residue dict_num.

fetch_uniprot([df_seq, features, evidence, ...])

Fetch UniProt features for every entry and map to df_annot.

ingest([df_user])

Ingest a user / predictor annotation table into df_annot.

register_feature([key, subcategory, ...])

Register (or override) an open-vocabulary Functional-sites key.

to_df_seq([df_seq, df_annot, feature_type, ...])

Project annotations onto df_seq for AAWindowSampler negative sampling.

__init__(verbose=True)[source]
Parameters:

verbose (bool, default=True) – If True, verbose outputs are enabled.

Notes

  • df_annot is the canonical per-residue schema with columns protein_id, start, end, aa, feature_type, category, source, evidence, score, bond_id (positions are 1-based, UniProt-canonical frame).

  • Encoder values are normalized to [0, 1]; non-annotated in-coverage residues are 0.0; NaN marks genuinely unresolved positions.

  • Bond features (disulfide / cross-link) expand to two single-residue endpoints sharing a bond_id; cleavage P1 anchors come from SIGNAL / PROPEP / TRANSIT span ends, not from the SITE grab-bag.

  • Two methods have no StructurePreprocessor analog by design, not oversight: register_feature() is the surface of the open 'Functional sites' vocabulary (structure’s registry is closed), and to_df_seq() exports a seq-mode window-split because here an annotation is the window label (a structure feature never is).

See also

Examples

AnnotationPreprocessor turns per-residue annotations (UniProt PTMs / functional sites, or your own predictor labels) into [0, 1]-normalized per-residue dict_num tensors for CPP.run_num — the same shape StructurePreprocessor and EmbeddingPreprocessor produce, so they stack via aa.combine_dict_nums.

Two categories are registered: PTMs (closed UniProt vocabulary — phospho, glyco, disulfide, cleavage, …) and Functional sites (open vocabulary — BINDING/ACT_SITE/DNA_BIND seeds plus user/predictor keys).

This notebook runs fully offline. The live fetch_uniprot call (which hits the UniProt REST API) is shown at the end but not executed.

import warnings

import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut

aa.options['verbose'] = False
warnings.filterwarnings('ignore')

ap = aa.AnnotationPreprocessor(verbose=False)

# A small labeled corpus: label-1 proteins carry phospho on S/T in the TMD
# slice (positions 11-30) far more often than label-0 — a sparse-presence,
# class-discriminating signal.
rng = np.random.default_rng(0)
aas = list(ut.LIST_CANONICAL_AA)
rows, annot = [], []
for label in (0, 1):
    for k in range(5):
        entry = f'A{label}_{k}'
        seq = ''.join(rng.choice(aas, size=40))
        rows.append({ut.COL_ENTRY: entry, ut.COL_SEQ: seq,
                     'label': label, 'tmd_start': 11, 'tmd_stop': 30})
        for i, ch in enumerate(seq):
            pos = i + 1
            if ch in 'ST' and 11 <= pos <= 30 and rng.random() < (0.8 if label == 1 else 0.1):
                annot.append([entry, pos, pos, ch, 'phospho', 'PTMs',
                              'UniProt', 'ECO:0000269', 1.0, None])
df_seq = pd.DataFrame(rows)
df_seq.head()

Any user/predictor table with protein_id, start, and feature_type becomes Functional sites rows. Unknown feature_types auto-register (here a hypothetical hotspot predictor with per-residue confidence in [0, 1]).

df_user = pd.DataFrame({
    ut.COL_PROTEIN_ID: ['A1_0', 'A1_0', 'A0_0'],
    ut.COL_START:      [12, 15, 13],
    ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot', 'hotspot'],
    ut.COL_SCORE:      [0.92, 0.40, 0.31],
})
df_annot_user = ap.ingest(df_user)
print("auto-registered 'hotspot':", 'hotspot' in ap._registry)
df_annot_user

In practice you would call ap.fetch_uniprot(df_seq, features=['phospho', ...]) to build this from the UniProt REST API. Here we use the hand-built phospho annotation table from the corpus above — it follows the exact same canonical schema (ut.COLS_ANNOT).

df_annot = pd.DataFrame(annot, columns=ut.COLS_ANNOT)
print(f'{len(df_annot)} phospho residues across {df_annot[ut.COL_PROTEIN_ID].nunique()} proteins')
df_annot.head()

encode maps each annotation onto the target df_seq[sequence], checking the expected residue identity (aa) at every position — a mismatch raises by default (on_mismatch='raise'), turning off-by-isoform errors into loud failures instead of silent mislabeling. Annotated residues carry the score; in-coverage non-annotated residues are 0.0.

dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot, features=['phospho'])
arr = dict_num['A1_0']
print('per-entry shape (L, D):', arr.shape)
print('phospho positions in A1_0 (1-based):', list(np.where(arr[:, 0] > 0)[0] + 1))

build_scales gives the corpus-derived per-AA means that keep run_num’s redundancy cor-gate live; build_cat is the corpus-free metadata that tags each dimension with its category (PTMs / Functional sites) and locked color.

df_scales = ap.build_scales(df_seq=df_seq, dict_num=dict_num, features=['phospho'])
df_cat = ap.build_cat(features=['phospho'])
print('phospho color:', ut.DICT_COLOR_CAT[df_cat[ut.COL_CAT].iloc[0]])
df_cat

The (dict_num, df_scales, df_cat) triple is drop-in compatible with CPP.run_num. Labels come from your own df_seq (the annotation is a feature, not the label). The resulting df_feat carries category='PTMs'.

nf = aa.NumericalFeature()
df_parts, dict_num_parts = nf.get_parts(df_seq=df_seq, dict_num=dict_num)
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales, df_cat=df_cat)
df_feat = cpp.run_num(dict_num_parts=dict_num_parts,
                      labels=df_seq['label'].tolist(),
                      n_filter=10, n_jobs=1)
print('categories:', df_feat[ut.COL_CAT].unique().tolist())
df_feat[[ut.COL_FEATURE, ut.COL_CAT, 'abs_auc']].head() if 'abs_auc' in df_feat.columns else df_feat[[ut.COL_FEATURE, ut.COL_CAT]].head()

fetch_uniprot queries the UniProt REST API per entry, maps the features array into the canonical schema (bond features expand to two endpoints + a bond_id; signal/propeptide/transit cleavage P1 anchors come from the processing-span ends; SITE is description-routed), and filters by evidence (evidence='manual' keeps experimental ECO:0000269 + combinatorial ECO:0007744, dropping by-similarity ECO:0000250):

ap = aa.AnnotationPreprocessor()
df_annot = ap.fetch_uniprot(
    df_seq=df_seq,                              # entry column = UniProt accessions
    features=['phospho', 'disulfide', 'binding'],
    evidence='manual',
)
dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot,
                     features=['phospho', 'disulfide', 'binding'],
                     on_mismatch='raise')       # off-by-isoform guard