AnnotationPreprocessor.encode

AnnotationPreprocessor.encode(df_seq, df_annot, features, on_mismatch='raise', return_df=False)[source]

Encode df_annot into a [0, 1]-normalized per-residue dict_num.

Converts the canonical annotation table (from fetch_uniprot() or ingest()) into a {entry: (L, D) ndarray} tensor where each dimension corresponds to a registered feature_type and values are normalized to [0, 1]. The result can be stacked with other per-residue tensors via aaanalysis.combine_dict_nums() and consumed directly by CPP.run_num().

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. The target coordinate frame; the residue-identity guard checks each annotated position against sequence.

  • df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from fetch_uniprot() or ingest()).

  • features (list of str) – Registry keys to encode, in the order they should occupy the D axis.

  • on_mismatch ({'raise', 'drop', 'warn'}, default='raise') – Behavior when df_seq[sequence][pos-1] != df_annot.aa for a row carrying a non-empty aa (an off-by-isoform / coordinate-frame error). 'raise' aborts; 'drop' silently skips the row; 'warn' warns and skips.

  • return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

Returns:

  • dict_num (dict[str, np.ndarray]) – {entry: (L_entry, D) ndarray} where D == len(features) and L_entry == len(sequence). Stack with other per-residue tensors via aaanalysis.combine_dict_nums().

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean encode_ok column — False for entries that had at least one position skipped due to a residue-identity mismatch under on_mismatch='drop' / 'warn' (always True under 'raise', which aborts instead).

Raises:

ValueError – On invalid arguments, missing schema columns, or (default) a residue-identity mismatch.

Examples

encode maps each annotation in df_annot onto the target df_seq[sequence] and returns a [0, 1]-normalized per-residue dict_num ({entry: (L, D)}) for CPP.run_num — one dimension per feature_type. The expected residue identity (aa) is checked at every position (on_mismatch='raise' by default, the off-by-isoform guard); annotated residues carry their score, in-coverage non-annotated residues are 0.0.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

ap = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = ap.ingest(df_user)

dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot,
                     features=['hotspot'])
arr = dict_num['AF_TINY']
print('shape (L, D):', arr.shape)
print('annotated positions (1-based):',
      list(np.where(arr[:, 0] > 0)[0] + 1))
shape (L, D): (30, 1)
annotated positions (1-based): [np.int64(3), np.int64(16)]

Stack the result with structure / embedding dict_nums via aa.combine_dict_nums, slice with NumericalFeature.get_parts, then run CPP.run_num.