StructurePreprocessor.encode_pdb

StructurePreprocessor.encode_pdb(df_seq, pdb_folder, features, plddt_disorder_threshold=70.0, on_failure='nan', return_df=False)[source]

Extract per-residue features from PDB ATOM records into dict_pdb.

Reads geometric and confidence features straight from a structure’s ATOM records — experimental bfactor / residue depth (the latter via the msms surface program [Sanner96]) and AlphaFold [Jumper21] model-file features (plddt confidence, sidechain chi*_sincos, contacts, …) — and encodes the chosen features into the [0, 1]-normalized per-residue dict_num that CPP.run_num() consumes. It is the ATOM-side companion of encode_dssp() and encode_pae(); stack their outputs with aaanalysis.combine_dict_nums().

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. entry is the PDB-file basename; sequence is the target sequence used for chain selection and alignment.

  • pdb_folder (str or pathlib.Path) – Directory containing one <entry>.pdb file per row of df_seq.

  • features (list of str) – Feature keys from the StructurePreprocessor registry that belong to encode_pdb: any subset of {bfactor, depth, plddt, plddt_disorder, plddt_tier, chi1_sincos, chi2_sincos, ca_centroid_dist, ca_centroid_dist_norm, contact_count_8A, contact_count_12A, hse, disulfide}. The depth feature requires the external msms binary on PATH; absence raises RuntimeError with an install hint.

  • plddt_disorder_threshold (float, default=70.0) – predicted Local Distance Difference Test (pLDDT) cutoff (in [0, 100]) for the plddt_disorder feature: a residue whose AlphaFold pLDDT is below this value is flagged disordered (1.0), else ordered (0.0).

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – Failure policy for entries whose PDB load fails (missing file, unparseable structure, no matched chain). 'nan' fills with NaN-only tensors; 'drop' removes those entries; 'raise' re-raises.

  • return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

Returns:

  • dict_pdb (dict[str, np.ndarray]) – {entry: (L_entry, D_total) ndarray} per-residue PDB features concatenated in the order of features.

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean pdb_ok column.

Raises:
  • ValueError – On invalid arguments.

  • RuntimeError – If msms is not installed and 'depth' is requested, or if any entry failed under on_failure='raise'.

Examples

encode_pdb extracts per-residue features straight from PDB / CIF ATOM records — AlphaFold model-file features (plddt, plddt_disorder, plddt_tier, sidechain chi*_sincos, ca_centroid_dist*, contact_count_*, hse) and experimental bfactor / depth — into a [0, 1]-normalized dict_num. Here we use the bundled AF_TINY fixture (depth is omitted as it needs the external msms binary).

import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
stp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})

feats = ['plddt', 'plddt_disorder', 'plddt_tier',
         'contact_count_8A', 'bfactor']
dict_pdb = stp.encode_pdb(df_seq=df_seq, pdb_folder=str(PDB_FIXTURES),
                          features=feats)
arr = dict_pdb['AF_TINY']
print('shape (L, D):', arr.shape)
print('value range:', round(float(np.nanmin(arr)), 3),
      '..', round(float(np.nanmax(arr)), 3))
shape (L, D): (30, 8)
value range: 0.0 .. 1.0

Each row is a residue, each column a feature dimension in [0, 1]. Pass return_df=True for the (dict_num, df_seq_out) form whose pdb_ok column flags entries whose structure file failed to load.