StructurePreprocessor.encode_dssp

StructurePreprocessor.encode_dssp(df_seq, pdb_folder=None, *, features, ss_mode='ss3', gap_handling='pad', on_failure='nan', return_df=False)[source]

Run Define Secondary Structure of Proteins (DSSP) and the per-feature encoders to build a [0, 1]-normalized dict_dssp.

DSSP [Kabsch83] assigns per-residue secondary structure (SS), solvent accessibility, and backbone hydrogen-bond geometry from a 3D structure. This method runs it (via the mkdssp binary [Touw15], or reuses the columns already produced by get_dssp()) and encodes the chosen features into the [0, 1]-normalized per-residue dict_num that CPP.run_num() consumes. It is the DSSP-side companion of encode_pdb() (ATOM-record features) and encode_pae() (AlphaFold Predicted Aligned Error (PAE)); stack their outputs with aaanalysis.combine_dict_nums().

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Pre-computed DSSP columns (from get_dssp()) are reused if present; otherwise DSSP is run inline.

  • pdb_folder (str or pathlib.Path) – Directory containing one <entry>.pdb file per row. Required when df_seq does not already carry the necessary DSSP columns.

  • features (list of str) – Feature keys from the StructurePreprocessor registry that belong to encode_dssp: any subset of {'ss3', 'ss8', 'rasa', 'phi_psi_sincos', 'hbond_donor', 'hbond_acceptor'}. Each key’s output is normalized to [0, 1] per the registry’s NORMALIZATION_RECIPES (de-normalization table in the class Notes).

  • ss_mode ({'ss3', 'ss8'}, default='ss3') – Forwarded to get_dssp() when DSSP is run inline. The chosen SS feature key ('ss3' / 'ss8') drives the actual one-hot dimensionality independently of this option.

  • gap_handling ({'pad', 'omit'}, default='pad') – Forwarded to get_dssp() when DSSP is run inline.

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose DSSP run failed. 'nan' fills with NaN-only tensors; 'drop' removes failed entries from the output dict; 'raise' raises RuntimeError if any entry failed.

  • return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

Returns:

  • dict_dssp (dict[str, np.ndarray]) – {entry: (L_entry, D_total) ndarray} per-residue DSSP features concatenated in the order of features. Values are in [0, 1] (NaN for unresolved positions).

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of the (possibly DSSP-augmented) df_seq plus an encode_dssp_ok column flagging per-row success. Rows are dropped when on_failure='drop'.

Raises:
  • ValueError – On invalid arguments or feature keys not in this method’s registry slice.

  • RuntimeError – If mkdssp is unavailable, or if any entry failed under on_failure='raise'.

Examples

encode_dssp runs DSSP on each entry’s structure file and encodes the per-residue secondary structure / solvent accessibility / backbone dihedrals into a [0, 1]-normalized dict_num ({entry: (L, D)}) for CPP.run_num. features is any subset of {ss3, ss8, rasa, phi_psi_sincos, hbond_donor, hbond_acceptor}.

Requires aaanalysis[pro] plus a mkdssp binary on PATH — shown here illustratively (see get_dssp for the executed DSSP example).

import aaanalysis as aa

# stp = aa.StructurePreprocessor()
# dict_dssp = stp.encode_dssp(df_seq=df_seq, pdb_folder='pdb_files/',
#                            features=['ss3', 'rasa', 'phi_psi_sincos'])
# # {entry: (L, D)} with values in [0, 1]; stack with other sources
# # via aa.combine_dict_nums and feed NumericalFeature.get_parts.

Further parameters. StructurePreprocessor.encode_dssp also accepts: ss_mode — Forwarded to :meth:get_dssp when DSSP is run inline; gap_handling — Forwarded to :meth:get_dssp when DSSP is run inline; on_failure — What to do for entries whose DSSP run failed; return_df — If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out).