AnnotationPreprocessor.encode
- AnnotationPreprocessor.encode(df_seq, df_annot, features, on_mismatch='raise', return_df=False)[source]
Encode
df_annotinto a[0, 1]-normalized per-residuedict_num.Converts the canonical annotation table (from
fetch_uniprot()oringest()) into a{entry: (L, D) ndarray}tensor where each dimension corresponds to a registeredfeature_typeand values are normalized to[0, 1]. The result can be stacked with other per-residue tensors viaaaanalysis.combine_dict_nums()and consumed directly byCPP.run_num().Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. The target coordinate frame; the residue-identity guard checks each annotated position againstsequence.df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from
fetch_uniprot()oringest()).features (list of str) – Registry keys to encode, in the order they should occupy the D axis.
on_mismatch ({'raise', 'drop', 'warn'}, default='raise') – Behavior when
df_seq[sequence][pos-1] != df_annot.aafor a row carrying a non-emptyaa(an off-by-isoform / coordinate-frame error).'raise'aborts;'drop'silently skips the row;'warn'warns and skips.return_df (bool, default=False) – If
True, also return the per-row status DataFrame as a second element(dict_num, df_seq_out). IfFalse(default), return onlydict_num.
- Returns:
dict_num (dict[str, np.ndarray]) –
{entry: (L_entry, D) ndarray}whereD == len(features)andL_entry == len(sequence). Stack with other per-residue tensors viaaaanalysis.combine_dict_nums().df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seqplus a booleanencode_okcolumn —Falsefor entries that had at least one position skipped due to a residue-identity mismatch underon_mismatch='drop'/'warn'(alwaysTrueunder'raise', which aborts instead).
- Raises:
ValueError – On invalid arguments, missing schema columns, or (default) a residue-identity mismatch.
Examples
encodemaps each annotation indf_annotonto the targetdf_seq[sequence]and returns a[0, 1]-normalized per-residuedict_num({entry: (L, D)}) forCPP.run_num— one dimension perfeature_type. The expected residue identity (aa) is checked at every position (on_mismatch='raise'by default, the off-by-isoform guard); annotated residues carry theirscore, in-coverage non-annotated residues are0.0.import warnings import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') ap = aa.AnnotationPreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) # A small user/predictor table -> Functional sites (open vocabulary). df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'], ut.COL_START: [3, 16], ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'], ut.COL_SCORE: [0.92, 0.40]}) df_annot = ap.ingest(df_user) dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot, features=['hotspot']) arr = dict_num['AF_TINY'] print('shape (L, D):', arr.shape) print('annotated positions (1-based):', list(np.where(arr[:, 0] > 0)[0] + 1))
shape (L, D): (30, 1) annotated positions (1-based): [np.int64(3), np.int64(16)]
Stack the result with structure / embedding
dict_nums viaaa.combine_dict_nums, slice withNumericalFeature.get_parts, then runCPP.run_num.