aaanalysis.StructurePreprocessor.encode_pae

StructurePreprocessor.encode_pae(df_seq=None, pae_folder=None, features=None, local_window=5, pae_band_edges=(5, 15), on_failure='nan', return_df=False, verbose=None)[source]

Load AlphaFold PAE sidecar JSONs and produce dict_pae.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. The PAE matrix shape (L, L) must equal len(sequence); mismatched rows are treated as failures.

  • pae_folder (str or pathlib.Path) – Directory containing one PAE JSON per row. The resolver tries, in order: <entry>.json, <entry>.json.gz, and the AF-DB canonical AF-<entry>-F1-predicted_aligned_error_v4.json (and its .gz variant).

  • features (list of str) – Feature keys belonging to encode_pae: any subset of {pae_row_mean, pae_row_min, pae_row_max, pae_local_mean, pae_distal_mean, pae_asymmetry, pae_band_means}. All outputs normalized to [0, 1] (divisor 31.75 Å for most keys, 10 Å for pae_asymmetry).

  • local_window (int, default=5) – Used by pae_local_mean / pae_distal_mean. The ±k window in residue positions for the local mean (self excluded); distal mean takes the complement.

  • pae_band_edges (tuple of (int, int), default=(5, 15)) – Used by pae_band_means only. Sequence-distance bins: (0, edges[0]], (edges[0], edges[1]], (edges[1], L].

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose PAE load fails (missing sidecar, malformed JSON, shape mismatch).

  • return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

  • verbose (bool, optional) – Override instance verbosity for this call only.

Returns:

  • dict_pae (dict[str, np.ndarray]) – {entry: (L_entry, D_total) ndarray} per-residue PAE features concatenated in the order of features.

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean pae_ok column.

Raises:
  • ValueError – On invalid arguments or feature keys not in this method’s registry slice.

  • RuntimeError – If any entry failed under on_failure='raise'.