aaanalysis.AnnotationPreprocessor.encode

AnnotationPreprocessor.encode(df_seq=None, df_annot=None, features=None, on_mismatch='raise', return_df=False, verbose=None)[source]

Encode df_annot into a [0, 1]-normalized per-residue dict_num.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. The target coordinate frame; the residue-identity guard checks each annotated position against sequence.

  • df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from fetch_uniprot() or ingest()).

  • features (list of str) – Registry keys to encode, in the order they should occupy the D axis.

  • on_mismatch ({'raise', 'drop', 'warn'}, default='raise') – Behavior when df_seq[sequence][pos-1] != df_annot.aa for a row carrying a non-empty aa (an off-by-isoform / coordinate-frame error). 'raise' aborts; 'drop' silently skips the row; 'warn' warns and skips.

  • return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

  • verbose (Optional[bool])

Returns:

  • dict_num (dict[str, np.ndarray]) – {entry: (L_entry, D) ndarray} where D == len(features) and L_entry == len(sequence). Stack with other per-residue tensors via aaanalysis.combine_dict_nums().

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean encode_ok column — False for entries that had at least one position skipped due to a residue-identity mismatch under on_mismatch='drop' / 'warn' (always True under 'raise', which aborts instead).

Raises:

ValueError – On invalid arguments, missing schema columns, or (default) a residue-identity mismatch.