StructurePreprocessor.encode_pae

StructurePreprocessor.encode_pae(df_seq, pae_folder, features, local_window=5, pae_band_edges=(5, 15), on_failure='nan', return_df=False)[source]

Load AlphaFold PAE sidecar JSONs and produce dict_pae.

The Predicted Aligned Error (PAE) is AlphaFold’s [Jumper21] per-residue-pair confidence map (the expected positional error, in Å). This method reads each entry’s L×L PAE matrix and collapses it into the [0, 1]-normalized per-residue summaries (row statistics, local vs distal means, asymmetry, band means) that CPP.run_num() consumes. It is the PAE-side companion of encode_dssp() and encode_pdb(); stack their outputs with aaanalysis.combine_dict_nums().

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. The PAE matrix shape (L, L) must equal len(sequence); mismatched rows are treated as failures.
pae_folder (str or pathlib.Path) – Directory containing one PAE JSON per row. The resolver tries, in order: <entry>.json, <entry>.json.gz, and the AlphaFold Database (AF-DB) canonical AF-<entry>-F1-predicted_aligned_error_v4.json (and its .gz variant).
features (list of str) – Feature keys belonging to encode_pae: any subset of {pae_row_mean, pae_row_min, pae_row_max, pae_local_mean, pae_distal_mean, pae_asymmetry, pae_band_means}. All outputs normalized to [0, 1] (divisor 31.75 Å for most keys, 10 Å for pae_asymmetry).
local_window (int, default=5) – Used by pae_local_mean / pae_distal_mean. The ±k window in residue positions for the local mean (self excluded); distal mean takes the complement.
pae_band_edges (tuple of (int, int), default=(5, 15)) – Used by pae_band_means only. Sequence-distance bins: (0, edges[0]], (edges[0], edges[1]], (edges[1], L].
on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose PAE load fails (missing sidecar, malformed JSON, shape mismatch). If instead a single PAE encoder raises for an otherwise-loaded entry, only that feature’s column(s) are NaN-filled (per-feature isolation) and one UserWarning names it — unless on_failure='raise'.
return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

Returns:

dict_pae (dict[str, np.ndarray]) – {entry: (L_entry, D_total) ndarray} per-residue PAE features concatenated in the order of features.
df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean pae_ok column.

Raises:

ValueError – On invalid arguments or feature keys not in this method’s registry slice.
RuntimeError – If any entry failed under on_failure='raise'.

Examples

encode_pae reads an AlphaFold PAE sidecar JSON and collapses the L×L predicted-aligned-error matrix into per-residue summaries (pae_row_mean/min/max, pae_local_mean / pae_distal_mean within / beyond ±local_window, pae_asymmetry, pae_band_means), all normalized to [0, 1] by AlphaFold’s 31.75 Å saturation cap.

import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
strp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
import tempfile, shutil
pae_dir = tempfile.mkdtemp()
shutil.copy(PDB_FIXTURES / 'AF_TINY_pae.json',
            Path(pae_dir) / 'AF_TINY.json')

feats = ['pae_row_mean', 'pae_local_mean', 'pae_distal_mean',
         'pae_asymmetry']
dict_pae = strp.encode_pae(df_seq=df_seq, pae_folder=pae_dir,
                          features=feats, local_window=5)
arr = dict_pae['AF_TINY']
print('shape (L, D):', arr.shape)
print('mean local vs distal PAE:',
      round(float(np.nanmean(arr[:, 1])), 3), 'vs',
      round(float(np.nanmean(arr[:, 2])), 3))

shape (L, D): (30, 4)
mean local vs distal PAE: 0.087 vs 0.361

Further parameters. StructurePreprocessor.encode_pae also accepts: pae_band_edges — Used by pae_band_means only; on_failure — What to do for entries whose PAE load fails (missing sidecar, malformed JSON, shape mismatch); return_df — If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out).

# Further parameters: ``pae_band_edges`` sets the distance bands used by the
# ``pae_band_means`` feature, ``on_failure`` governs entries whose PAE load
# fails, and ``return_df=True`` also returns a per-row status frame (``pae_ok``).
dict_pae_band, df_pae_status = strp.encode_pae(
    df_seq=df_seq, pae_folder=pae_dir,
    features=['pae_row_mean', 'pae_band_means'],
    pae_band_edges=(5, 15), on_failure='nan', return_df=True)
aa.display_df(df_pae_status, n_rows=10, show_shape=True)

DataFrame shape: (1, 3)

	entry	sequence	pae_ok
1	AF_TINY	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	True