aaanalysis.AnnotationPreprocessor
- class aaanalysis.AnnotationPreprocessor(verbose=True)[source]
Bases:
objectPreprocessing class for per-residue PTM / functional-site annotations [Breimann25a].
Mirrors
EmbeddingPreprocessor’s instance-based shape but is the annotation-side companion: fetches (UniProt) or ingests (user / predictor) annotations and encodes them into thedict_numtensor thatNumericalFeature.get_parts()slices into per-part inputs forCPP.run_num(), plus the(df_scales, df_cat)metadata pair.Added in version 1.1.0.
- Parameters:
verbose (
bool)
Methods
build_cat([features, dim_names_override])Build the
df_catmetadata frame forfeatures(corpus-free).build_scales([df_seq, dict_num, features, ...])Build
df_scalesby context-free per-AA averaging of the corpus.encode([df_seq, df_annot, features, ...])Encode
df_annotinto a[0, 1]-normalized per-residuedict_num.fetch_uniprot([df_seq, features, evidence, ...])Fetch UniProt features for every entry and map to
df_annot.ingest([df_user])Ingest a user / predictor annotation table into
df_annot.register_feature([key, subcategory, ...])Register (or override) an open-vocabulary Functional-sites key.
to_df_seq([df_seq, df_annot, feature_type, ...])Project annotations onto
df_seqfor AAWindowSampler negative sampling.- __init__(verbose=True)[source]
- Parameters:
verbose (bool, default=True) – If
True, verbose outputs are enabled.
Notes
df_annotis the canonical per-residue schema with columnsprotein_id, start, end, aa, feature_type, category, source, evidence, score, bond_id(positions are 1-based, UniProt-canonical frame).Encoder values are normalized to
[0, 1]; non-annotated in-coverage residues are0.0;NaNmarks genuinely unresolved positions.Bond features (disulfide / cross-link) expand to two single-residue endpoints sharing a
bond_id; cleavage P1 anchors come from SIGNAL / PROPEP / TRANSIT span ends, not from theSITEgrab-bag.Two methods have no
StructurePreprocessoranalog by design, not oversight:register_feature()is the surface of the open'Functional sites'vocabulary (structure’s registry is closed), andto_df_seq()exports a seq-mode window-split because here an annotation is the window label (a structure feature never is).
See also
StructurePreprocessor: the structure-side analog (PDB / DSSP / AlphaFold).EmbeddingPreprocessor: the PLM-embedding analog.aaanalysis.combine_dict_nums(): stitch multiple dict_nums.CPP.run_num(): the downstream consumer.
Examples
AnnotationPreprocessorturns per-residue annotations (UniProt PTMs / functional sites, or your own predictor labels) into[0, 1]-normalized per-residuedict_numtensors forCPP.run_num— the same shapeStructurePreprocessorandEmbeddingPreprocessorproduce, so they stack viaaa.combine_dict_nums.Two categories are registered: PTMs (closed UniProt vocabulary — phospho, glyco, disulfide, cleavage, …) and Functional sites (open vocabulary — BINDING/ACT_SITE/DNA_BIND seeds plus user/predictor keys).
This notebook runs fully offline. The live
fetch_uniprotcall (which hits the UniProt REST API) is shown at the end but not executed.import warnings import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') ap = aa.AnnotationPreprocessor(verbose=False) # A small labeled corpus: label-1 proteins carry phospho on S/T in the TMD # slice (positions 11-30) far more often than label-0 — a sparse-presence, # class-discriminating signal. rng = np.random.default_rng(0) aas = list(ut.LIST_CANONICAL_AA) rows, annot = [], [] for label in (0, 1): for k in range(5): entry = f'A{label}_{k}' seq = ''.join(rng.choice(aas, size=40)) rows.append({ut.COL_ENTRY: entry, ut.COL_SEQ: seq, 'label': label, 'tmd_start': 11, 'tmd_stop': 30}) for i, ch in enumerate(seq): pos = i + 1 if ch in 'ST' and 11 <= pos <= 30 and rng.random() < (0.8 if label == 1 else 0.1): annot.append([entry, pos, pos, ch, 'phospho', 'PTMs', 'UniProt', 'ECO:0000269', 1.0, None]) df_seq = pd.DataFrame(rows) df_seq.head()
Any user/predictor table with
protein_id,start, andfeature_typebecomes Functional sites rows. Unknownfeature_types auto-register (here a hypotheticalhotspotpredictor with per-residue confidence in[0, 1]).df_user = pd.DataFrame({ ut.COL_PROTEIN_ID: ['A1_0', 'A1_0', 'A0_0'], ut.COL_START: [12, 15, 13], ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot', 'hotspot'], ut.COL_SCORE: [0.92, 0.40, 0.31], }) df_annot_user = ap.ingest(df_user) print("auto-registered 'hotspot':", 'hotspot' in ap._registry) df_annot_user
In practice you would call
ap.fetch_uniprot(df_seq, features=['phospho', ...])to build this from the UniProt REST API. Here we use the hand-builtphosphoannotation table from the corpus above — it follows the exact same canonical schema (ut.COLS_ANNOT).df_annot = pd.DataFrame(annot, columns=ut.COLS_ANNOT) print(f'{len(df_annot)} phospho residues across {df_annot[ut.COL_PROTEIN_ID].nunique()} proteins') df_annot.head()
encodemaps each annotation onto the targetdf_seq[sequence], checking the expected residue identity (aa) at every position — a mismatch raises by default (on_mismatch='raise'), turning off-by-isoform errors into loud failures instead of silent mislabeling. Annotated residues carry thescore; in-coverage non-annotated residues are0.0.dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot, features=['phospho']) arr = dict_num['A1_0'] print('per-entry shape (L, D):', arr.shape) print('phospho positions in A1_0 (1-based):', list(np.where(arr[:, 0] > 0)[0] + 1))
build_scalesgives the corpus-derived per-AA means that keeprun_num’s redundancycor-gate live;build_catis the corpus-free metadata that tags each dimension with its category (PTMs/Functional sites) and locked color.df_scales = ap.build_scales(df_seq=df_seq, dict_num=dict_num, features=['phospho']) df_cat = ap.build_cat(features=['phospho']) print('phospho color:', ut.DICT_COLOR_CAT[df_cat[ut.COL_CAT].iloc[0]]) df_cat
The
(dict_num, df_scales, df_cat)triple is drop-in compatible withCPP.run_num. Labels come from your owndf_seq(the annotation is a feature, not the label). The resultingdf_featcarriescategory='PTMs'.nf = aa.NumericalFeature() df_parts, dict_num_parts = nf.get_parts(df_seq=df_seq, dict_num=dict_num) cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales, df_cat=df_cat) df_feat = cpp.run_num(dict_num_parts=dict_num_parts, labels=df_seq['label'].tolist(), n_filter=10, n_jobs=1) print('categories:', df_feat[ut.COL_CAT].unique().tolist()) df_feat[[ut.COL_FEATURE, ut.COL_CAT, 'abs_auc']].head() if 'abs_auc' in df_feat.columns else df_feat[[ut.COL_FEATURE, ut.COL_CAT]].head()
fetch_uniprotqueries the UniProt REST API perentry, maps thefeaturesarray into the canonical schema (bond features expand to two endpoints + abond_id; signal/propeptide/transit cleavage P1 anchors come from the processing-span ends;SITEis description-routed), and filters by evidence (evidence='manual'keeps experimental ECO:0000269 + combinatorial ECO:0007744, dropping by-similarity ECO:0000250):ap = aa.AnnotationPreprocessor() df_annot = ap.fetch_uniprot( df_seq=df_seq, # entry column = UniProt accessions features=['phospho', 'disulfide', 'binding'], evidence='manual', ) dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot, features=['phospho', 'disulfide', 'binding'], on_mismatch='raise') # off-by-isoform guard