aaanalysis.AnnotationPreprocessor.encode
- AnnotationPreprocessor.encode(df_seq=None, df_annot=None, features=None, on_mismatch='raise', return_df=False, verbose=None)[source]
Encode
df_annotinto a[0, 1]-normalized per-residuedict_num.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. The target coordinate frame; the residue-identity guard checks each annotated position againstsequence.df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from
fetch_uniprot()oringest()).features (list of str) – Registry keys to encode, in the order they should occupy the D axis.
on_mismatch ({'raise', 'drop', 'warn'}, default='raise') – Behavior when
df_seq[sequence][pos-1] != df_annot.aafor a row carrying a non-emptyaa(an off-by-isoform / coordinate-frame error).'raise'aborts;'drop'silently skips the row;'warn'warns and skips.return_df (bool, default=False) – If
True, also return the per-row status DataFrame as a second element(dict_num, df_seq_out). IfFalse(default), return onlydict_num.
- Returns:
dict_num (dict[str, np.ndarray]) –
{entry: (L_entry, D) ndarray}whereD == len(features)andL_entry == len(sequence). Stack with other per-residue tensors viaaaanalysis.combine_dict_nums().df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seqplus a booleanencode_okcolumn —Falsefor entries that had at least one position skipped due to a residue-identity mismatch underon_mismatch='drop'/'warn'(alwaysTrueunder'raise', which aborts instead).
- Raises:
ValueError – On invalid arguments, missing schema columns, or (default) a residue-identity mismatch.