aaanalysis.StructurePreprocessor.encode_pdb

StructurePreprocessor.encode_pdb(df_seq=None, pdb_folder=None, features=None, plddt_disorder_threshold=70.0, on_failure='nan', return_df=False, verbose=None)[source]

Extract per-residue features from PDB ATOM records into dict_pdb.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. entry is the PDB-file basename; sequence is the target sequence used for chain selection and alignment.

  • pdb_folder (str or pathlib.Path) – Directory containing one <entry>.pdb file per row of df_seq.

  • features (list of str) – Feature keys from the StructurePreprocessor registry that belong to encode_pdb: any subset of {bfactor, depth}. The depth feature requires the external msms binary on PATH; absence raises RuntimeError with an install hint.

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – Failure policy for entries whose PDB load fails (missing file, unparseable structure, no matched chain). 'nan' fills with NaN-only tensors; 'drop' removes those entries; 'raise' re-raises.

  • return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

  • verbose (bool, optional) – Override instance verbosity for this call only.

  • plddt_disorder_threshold (float)

Returns:

  • dict_pdb (dict[str, np.ndarray]) – {entry: (L_entry, D_total) ndarray} per-residue PDB features concatenated in the order of features.

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean pdb_ok column.

Raises:
  • ValueError – On invalid arguments.

  • RuntimeError – If msms is not installed and 'depth' is requested, or if any entry failed under on_failure='raise'.