StructurePreprocessor.encode_pdb
- StructurePreprocessor.encode_pdb(df_seq, pdb_folder, features, plddt_disorder_threshold=70.0, on_failure='nan', return_df=False)[source]
Extract per-residue features from PDB ATOM records into
dict_pdb.Reads geometric and confidence features straight from a structure’s ATOM records — experimental
bfactor/ residuedepth(the latter via themsmssurface program [Sanner96]) and AlphaFold [Jumper21] model-file features (plddtconfidence, sidechainchi*_sincos, contacts, …) — and encodes the chosenfeaturesinto the[0, 1]-normalized per-residuedict_numthatCPP.run_num()consumes. It is the ATOM-side companion ofencode_dssp()andencode_pae(); stack their outputs withaaanalysis.combine_dict_nums().Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences.entryis the PDB-file basename;sequenceis the target sequence used for chain selection and alignment.pdb_folder (str or pathlib.Path) – Directory containing one
<entry>.pdbfile per row ofdf_seq.features (list of str) – Feature keys from the StructurePreprocessor registry that belong to
encode_pdb: any subset of{bfactor, depth, plddt, plddt_disorder, plddt_tier, chi1_sincos, chi2_sincos, ca_centroid_dist, ca_centroid_dist_norm, contact_count_8A, contact_count_12A, hse, disulfide}. Thedepthfeature requires the externalmsmsbinary on PATH; absence raisesRuntimeErrorwith an install hint.plddt_disorder_threshold (float, default=70.0) – predicted Local Distance Difference Test (pLDDT) cutoff (in
[0, 100]) for theplddt_disorderfeature: a residue whose AlphaFold pLDDT is below this value is flagged disordered (1.0), else ordered (0.0).on_failure ({'nan', 'drop', 'raise'}, default='nan') – Failure policy for entries whose PDB load fails (missing file, unparseable structure, no matched chain).
'nan'fills with NaN-only tensors;'drop'removes those entries;'raise're-raises.return_df (bool, default=False) – If
True, also return the per-row status DataFrame as a second element(dict_num, df_seq_out). IfFalse(default), return onlydict_num.
- Returns:
dict_pdb (dict[str, np.ndarray]) –
{entry: (L_entry, D_total) ndarray}per-residue PDB features concatenated in the order offeatures.df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seqplus a booleanpdb_okcolumn.
- Raises:
ValueError – On invalid arguments.
RuntimeError – If
msmsis not installed and'depth'is requested, or if any entry failed underon_failure='raise'.
Examples
encode_pdbextracts per-residue features straight from PDB / CIF ATOM records — AlphaFold model-file features (plddt,plddt_disorder,plddt_tier, sidechainchi*_sincos,ca_centroid_dist*,contact_count_*,hse) and experimentalbfactor/depth— into a[0, 1]-normalizeddict_num. Here we use the bundledAF_TINYfixture (depthis omitted as it needs the externalmsmsbinary).import warnings from pathlib import Path import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test' stp = aa.StructurePreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) feats = ['plddt', 'plddt_disorder', 'plddt_tier', 'contact_count_8A', 'bfactor'] dict_pdb = stp.encode_pdb(df_seq=df_seq, pdb_folder=str(PDB_FIXTURES), features=feats) arr = dict_pdb['AF_TINY'] print('shape (L, D):', arr.shape) print('value range:', round(float(np.nanmin(arr)), 3), '..', round(float(np.nanmax(arr)), 3))
shape (L, D): (30, 8) value range: 0.0 .. 1.0
Each row is a residue, each column a feature dimension in
[0, 1]. Passreturn_df=Truefor the(dict_num, df_seq_out)form whosepdb_okcolumn flags entries whose structure file failed to load.