StructurePreprocessor.get_dssp

StructurePreprocessor.get_dssp(df_seq, pdb_folder, features=None, ss_mode='ss3', gap_handling='pad')[source]

Run Define Secondary Structure of Proteins (DSSP) and append per-residue list columns to df_seq.

Runs the DSSP algorithm [Kabsch83] (via the mkdssp binary [Touw15]) on each entry’s PDB file and aligns the output to the target sequence in df_seq, appending per-residue list columns (secondary structure, Accessible Surface Area (ASA), backbone dihedrals, hydrogen bonds) plus a boolean dssp_ok flag. The result is the intermediate input that encode_dssp() consumes; call this method first to inspect the raw DSSP streams before encoding.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. entry is used as the PDB-file basename (<entry>.pdb) and sequence is the target sequence to which DSSP output is aligned.
pdb_folder (str or pathlib.Path) – Directory containing one <entry>.pdb file per row of df_seq. Missing files emit a UserWarning and produce dssp_ok=False for that row.
features (list of str, default=['ss', 'asa', 'phi_psi', 'hbonds']) – Which DSSP feature streams to extract. Any subset of {'ss', 'asa', 'phi_psi', 'hbonds'}. Only the requested columns are appended; dssp_ok is always appended.
ss_mode ({'ss3', 'ss8'}, default='ss3') – Secondary-structure encoding for the ss column.
gap_handling ({'pad', 'omit'}, default='pad') – How to handle positions without DSSP coverage. 'pad' preserves length-alignment to df_seq[sequence] and fills with ut.STR_SS_GAP / NaN; 'omit' drops them across all requested streams simultaneously.

Returns:

df_out – A copy of df_seq with appended list columns for each requested feature stream plus a boolean dssp_ok column.

Return type:

pd.DataFrame

Raises:

RuntimeError – If mkdssp / dssp is not on PATH.
ValueError – On invalid arguments or pre-existing output columns in df_seq.

Examples

get_dssp runs DSSP on each entry’s PDB / CIF file and returns a copy of df_seq with the raw, inspectable per-residue DSSP output (e.g. an ss column of secondary-structure codes) plus a dssp_ok flag — the curatable intermediate that encode_dssp turns into a dict_num. The chain whose ATOM sequence best matches df_seq[sequence] is selected automatically and aligned back to the full sequence (ss_mode='ss3'|'ss8', gap_handling='pad'|'omit').

Requires aaanalysis[pro] plus a mkdssp (or legacy dssp) binary on PATH — shown illustratively.

import aaanalysis as aa

# strp = aa.StructurePreprocessor()
# df = strp.get_dssp(df_seq=df_seq, pdb_folder='pdb_files/',
#                   features=['ss'], ss_mode='ss3', gap_handling='pad')
# # gap_handling ('pad'|'omit') controls how residues DSSP skips are aligned
# # back to the full sequence.
# df[['entry', 'ss', 'dssp_ok']]