StructurePreprocessor.encode_pae
- StructurePreprocessor.encode_pae(df_seq, pae_folder, features, local_window=5, pae_band_edges=(5, 15), on_failure='nan', return_df=False)[source]
Load AlphaFold PAE sidecar JSONs and produce
dict_pae.The Predicted Aligned Error (PAE) is AlphaFold’s [Jumper21] per-residue-pair confidence map (the expected positional error, in Å). This method reads each entry’s L×L PAE matrix and collapses it into the
[0, 1]-normalized per-residue summaries (row statistics, local vs distal means, asymmetry, band means) thatCPP.run_num()consumes. It is the PAE-side companion ofencode_dssp()andencode_pdb(); stack their outputs withaaanalysis.combine_dict_nums().Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. The PAE matrix shape(L, L)must equallen(sequence); mismatched rows are treated as failures.pae_folder (str or pathlib.Path) – Directory containing one PAE JSON per row. The resolver tries, in order:
<entry>.json,<entry>.json.gz, and the AlphaFold Database (AF-DB) canonicalAF-<entry>-F1-predicted_aligned_error_v4.json(and its.gzvariant).features (list of str) – Feature keys belonging to
encode_pae: any subset of{pae_row_mean, pae_row_min, pae_row_max, pae_local_mean, pae_distal_mean, pae_asymmetry, pae_band_means}. All outputs normalized to[0, 1](divisor 31.75 Å for most keys, 10 Å forpae_asymmetry).local_window (int, default=5) – Used by
pae_local_mean/pae_distal_mean. The±kwindow in residue positions for the local mean (self excluded); distal mean takes the complement.pae_band_edges (tuple of (int, int), default=(5, 15)) – Used by
pae_band_meansonly. Sequence-distance bins:(0, edges[0]],(edges[0], edges[1]],(edges[1], L].on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose PAE load fails (missing sidecar, malformed JSON, shape mismatch).
return_df (bool, default=False) – If
True, also return the per-row status DataFrame as a second element(dict_num, df_seq_out). IfFalse(default), return onlydict_num.
- Returns:
dict_pae (dict[str, np.ndarray]) –
{entry: (L_entry, D_total) ndarray}per-residue PAE features concatenated in the order offeatures.df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seqplus a booleanpae_okcolumn.
- Raises:
ValueError – On invalid arguments or feature keys not in this method’s registry slice.
RuntimeError – If any entry failed under
on_failure='raise'.
Examples
encode_paereads an AlphaFold PAE sidecar JSON and collapses the L×L predicted-aligned-error matrix into per-residue summaries (pae_row_mean/min/max,pae_local_mean/pae_distal_meanwithin / beyond±local_window,pae_asymmetry,pae_band_means), all normalized to[0, 1]by AlphaFold’s 31.75 Å saturation cap.import warnings from pathlib import Path import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test' stp = aa.StructurePreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) import tempfile, shutil pae_dir = tempfile.mkdtemp() shutil.copy(PDB_FIXTURES / 'AF_TINY_pae.json', Path(pae_dir) / 'AF_TINY.json') feats = ['pae_row_mean', 'pae_local_mean', 'pae_distal_mean', 'pae_asymmetry'] dict_pae = stp.encode_pae(df_seq=df_seq, pae_folder=pae_dir, features=feats, local_window=5) arr = dict_pae['AF_TINY'] print('shape (L, D):', arr.shape) print('mean local vs distal PAE:', round(float(np.nanmean(arr[:, 1])), 3), 'vs', round(float(np.nanmean(arr[:, 2])), 3))
shape (L, D): (30, 4) mean local vs distal PAE: 0.087 vs 0.361
Further parameters.
StructurePreprocessor.encode_paealso accepts:pae_band_edges— Used bypae_band_meansonly;on_failure— What to do for entries whose PAE load fails (missing sidecar, malformed JSON, shape mismatch);return_df— IfTrue, also return the per-row status DataFrame as a second element(dict_num, df_seq_out).