StructurePreprocessor.get_dssp
- StructurePreprocessor.get_dssp(df_seq, pdb_folder, features=None, ss_mode='ss3', gap_handling='pad')[source]
Run Define Secondary Structure of Proteins (DSSP) and append per-residue list columns to
df_seq.Runs the DSSP algorithm [Kabsch83] (via the
mkdsspbinary [Touw15]) on each entry’s PDB file and aligns the output to the target sequence indf_seq, appending per-residue list columns (secondary structure, Accessible Surface Area (ASA), backbone dihedrals, hydrogen bonds) plus a booleandssp_okflag. The result is the intermediate input thatencode_dssp()consumes; call this method first to inspect the raw DSSP streams before encoding.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences.entryis used as the PDB-file basename (<entry>.pdb) andsequenceis the target sequence to which DSSP output is aligned.pdb_folder (str or pathlib.Path) – Directory containing one
<entry>.pdbfile per row ofdf_seq. Missing files emit aUserWarningand producedssp_ok=Falsefor that row.features (list of str, default=['ss', 'asa', 'phi_psi', 'hbonds']) – Which DSSP feature streams to extract. Any subset of
{'ss', 'asa', 'phi_psi', 'hbonds'}. Only the requested columns are appended;dssp_okis always appended.ss_mode ({'ss3', 'ss8'}, default='ss3') – Secondary-structure encoding for the
sscolumn.gap_handling ({'pad', 'omit'}, default='pad') – How to handle positions without DSSP coverage.
'pad'preserves length-alignment todf_seq[sequence]and fills withut.STR_SS_GAP/ NaN;'omit'drops them across all requested streams simultaneously.
- Returns:
df_out – A copy of
df_seqwith appended list columns for each requested feature stream plus a booleandssp_okcolumn.- Return type:
pd.DataFrame
- Raises:
RuntimeError – If
mkdssp/dsspis not on PATH.ValueError – On invalid arguments or pre-existing output columns in
df_seq.
Examples
get_dsspruns DSSP on each entry’s PDB / CIF file and returns a copy ofdf_seqwith the raw, inspectable per-residue DSSP output (e.g. ansscolumn of secondary-structure codes) plus adssp_okflag — the curatable intermediate thatencode_dsspturns into adict_num. The chain whose ATOM sequence best matchesdf_seq[sequence]is selected automatically and aligned back to the full sequence (ss_mode='ss3'|'ss8',gap_handling='pad'|'omit').Requires
aaanalysis[pro]plus amkdssp(or legacydssp) binary on PATH — shown illustratively.import aaanalysis as aa # stp = aa.StructurePreprocessor() # df = stp.get_dssp(df_seq=df_seq, pdb_folder='pdb_files/', # features=['ss'], ss_mode='ss3') # df[['entry', 'ss', 'dssp_ok']]