StructurePreprocessor.encode_dssp
- StructurePreprocessor.encode_dssp(df_seq, pdb_folder=None, *, features, ss_mode='ss3', gap_handling='pad', on_failure='nan', return_df=False)[source]
Run Define Secondary Structure of Proteins (DSSP) and the per-feature encoders to build a
[0, 1]-normalizeddict_dssp.DSSP [Kabsch83] assigns per-residue secondary structure (SS), solvent accessibility, and backbone hydrogen-bond geometry from a 3D structure. This method runs it (via the
mkdsspbinary [Touw15], or reuses the columns already produced byget_dssp()) and encodes the chosenfeaturesinto the[0, 1]-normalized per-residuedict_numthatCPP.run_num()consumes. It is the DSSP-side companion ofencode_pdb()(ATOM-record features) andencode_pae()(AlphaFold Predicted Aligned Error (PAE)); stack their outputs withaaanalysis.combine_dict_nums().Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Pre-computed DSSP columns (fromget_dssp()) are reused if present; otherwise DSSP is run inline.pdb_folder (str or pathlib.Path) – Directory containing one
<entry>.pdbfile per row. Required whendf_seqdoes not already carry the necessary DSSP columns.features (list of str) – Feature keys from the StructurePreprocessor registry that belong to
encode_dssp: any subset of{'ss3', 'ss8', 'rasa', 'phi_psi_sincos', 'hbond_donor', 'hbond_acceptor'}. Each key’s output is normalized to[0, 1]per the registry’sNORMALIZATION_RECIPES(de-normalization table in the classNotes).ss_mode ({'ss3', 'ss8'}, default='ss3') – Forwarded to
get_dssp()when DSSP is run inline. The chosen SS feature key ('ss3'/'ss8') drives the actual one-hot dimensionality independently of this option.gap_handling ({'pad', 'omit'}, default='pad') – Forwarded to
get_dssp()when DSSP is run inline.on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose DSSP run failed.
'nan'fills with NaN-only tensors;'drop'removes failed entries from the output dict;'raise'raisesRuntimeErrorif any entry failed.return_df (bool, default=False) – If
True, also return the per-row status DataFrame as a second element(dict_num, df_seq_out). IfFalse(default), return onlydict_num.
- Returns:
dict_dssp (dict[str, np.ndarray]) –
{entry: (L_entry, D_total) ndarray}per-residue DSSP features concatenated in the order offeatures. Values are in[0, 1](NaN for unresolved positions).df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo of the (possibly DSSP-augmented)df_seqplus anencode_dssp_okcolumn flagging per-row success. Rows are dropped whenon_failure='drop'.
- Raises:
ValueError – On invalid arguments or feature keys not in this method’s registry slice.
RuntimeError – If
mkdsspis unavailable, or if any entry failed underon_failure='raise'.
Examples
encode_dsspruns DSSP on each entry’s structure file and encodes the per-residue secondary structure / solvent accessibility / backbone dihedrals into a[0, 1]-normalizeddict_num({entry: (L, D)}) forCPP.run_num.featuresis any subset of{ss3, ss8, rasa, phi_psi_sincos, hbond_donor, hbond_acceptor}.Requires
aaanalysis[pro]plus amkdsspbinary on PATH — shown here illustratively (seeget_dsspfor the executed DSSP example).import aaanalysis as aa # stp = aa.StructurePreprocessor() # dict_dssp = stp.encode_dssp(df_seq=df_seq, pdb_folder='pdb_files/', # features=['ss3', 'rasa', 'phi_psi_sincos']) # # {entry: (L, D)} with values in [0, 1]; stack with other sources # # via aa.combine_dict_nums and feed NumericalFeature.get_parts.
Further parameters.
StructurePreprocessor.encode_dsspalso accepts:ss_mode— Forwarded to :meth:get_dsspwhen DSSP is run inline;gap_handling— Forwarded to :meth:get_dsspwhen DSSP is run inline;on_failure— What to do for entries whose DSSP run failed;return_df— IfTrue, also return the per-row status DataFrame as a second element(dict_num, df_seq_out).