AnnotationPreprocessor.encode

AnnotationPreprocessor.encode(df_seq, df_annot, features, on_mismatch='raise', return_df=False)[source]

Encode df_annot into a [0, 1]-normalized per-residue dict_num.

Converts the canonical annotation table (from fetch_uniprot() or ingest()) into a {entry: (L, D) ndarray} tensor where each dimension corresponds to a registered feature_type and values are normalized to [0, 1]. The result can be stacked with other per-residue tensors via aaanalysis.combine_dict_nums() and consumed directly by CPP.run_num().

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. The target coordinate frame; the residue-identity guard checks each annotated position against sequence.
df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from fetch_uniprot() or ingest()).
features (list of str) – Registry keys to encode, in the order they should occupy the D axis.
on_mismatch ({'raise', 'drop', 'warn'}, default='raise') – Behavior when df_seq[sequence][pos-1] != df_annot.aa for a row carrying a non-empty aa (an off-by-isoform / coordinate-frame error). 'raise' aborts; 'drop' silently skips the row; 'warn' warns and skips.
return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

Returns:

dict_num (dict[str, np.ndarray]) – {entry: (L_entry, D) ndarray} where D == len(features) and L_entry == len(sequence). Stack with other per-residue tensors via aaanalysis.combine_dict_nums().
df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean encode_ok column — False for entries that had at least one position skipped due to a residue-identity mismatch under on_mismatch='drop' / 'warn' (always True under 'raise', which aborts instead).

Raises:

ValueError – On invalid arguments, missing schema columns, or (default) a residue-identity mismatch.

Examples

encode maps each annotation in df_annot onto the target df_seq[sequence] and returns a [0, 1]-normalized per-residue dict_num ({entry: (L, D)}) for CPP.run_num — one dimension per feature_type. The expected residue identity (aa) is checked at every position (on_mismatch='raise' by default, the off-by-isoform guard); annotated residues carry their score, in-coverage non-annotated residues are 0.0.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

annp = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = annp.ingest(df_user)

dict_num = annp.encode(df_seq=df_seq, df_annot=df_annot,
                     features=['hotspot'])
arr = dict_num['AF_TINY']
print('shape (L, D):', arr.shape)
print('annotated positions (1-based):',
      list(np.where(arr[:, 0] > 0)[0] + 1))

shape (L, D): (30, 1)
annotated positions (1-based): [np.int64(3), np.int64(16)]

# Further parameters: ``on_mismatch`` sets the off-by-isoform policy
# ('raise' | 'drop' | 'warn'), and ``return_df=True`` additionally returns a
# tidy long-form table of the mapped annotations.
dict_num, df_encoded = annp.encode(
    df_seq=df_seq, df_annot=df_annot, features=['hotspot'],
    on_mismatch='warn', return_df=True)
aa.display_df(df_encoded, n_rows=10, show_shape=True)

DataFrame shape: (1, 3)

	entry	sequence	encode_ok
1	AF_TINY	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	True

Stack the result with structure / embedding dict_nums via aa.combine_dict_nums, slice with NumericalFeature.get_parts, then run CPP.run_num.