StructurePreprocessor.get_domains

StructurePreprocessor.get_domains(df_seq, pdb_folder=None, pae_folder=None, tool='afragmenter', chainsaw_path=None, resolution=0.7, threshold=2.0, on_failure='nan')[source]

Run a domain-segmentation tool and append a chopping column.

Mirrors the get_dsspencode_dssp pattern: this method runs the external tool inline, returns a copy of df_seq with appended chopping (the common Merizo [Lau23] / Chainsaw [Wells24] chopping-string format) and domain_ok boolean columns. The result feeds into encode_domains() (which now accepts the in-memory chopping column directly).

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences.

  • pdb_folder (str or pathlib.Path, optional) – Directory with one <entry>.pdb / .cif / .pdb.gz / .cif.gz per row. Required when tool='chainsaw'.

  • pae_folder (str or pathlib.Path, optional) – Directory with one PAE JSON per row (same canonical filename resolution as encode_pae()). Required when tool='afragmenter'.

  • tool ({'chainsaw', 'afragmenter'}, default='afragmenter') –

    Which segmentation tool to run.

    • 'afragmenter' ([Verwimp25]): a schema-free, tuneable segmenter that builds a residue network from the AlphaFold Predicted Aligned Error (PAE) matrix and finds domains by Leiden clustering. Pip-installable; requires the optional extra pip install aaanalysis[pro] (lazy-import; the friendly install hint fires only when this tool is requested). Operates on the PAE matrix from pae_folder.

    • 'chainsaw' ([Wells24]): a fully-convolutional neural network that predicts domain boundaries from a PDB / CIF structure. Not on PyPI; clone the repo locally and pass its directory as chainsaw_path. Operates on PDB / CIF from pdb_folder via subprocess.

  • chainsaw_path (str or pathlib.Path, optional) – Local clone of the ChainSaw repository. Required when tool='chainsaw' (ignored otherwise).

  • resolution (float, default=0.7) – AFragmenter Leiden-resolution knob (only used when tool='afragmenter').

  • threshold (float, default=2.0) – AFragmenter PAE graph-edge threshold in Å (only used when tool='afragmenter').

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries where the tool fails / file is missing. 'nan' fills chopping with an empty string and marks domain_ok=False; 'drop' removes the row; 'raise' re-raises.

Returns:

df_out – A copy of df_seq with two appended columns: * chopping (str): the Merizo/ChainSaw common-format

chopping string, or '' on failure.

  • domain_ok (bool): True if the tool returned a non-empty chopping for this entry.

Return type:

pd.DataFrame

Raises:
  • ValueError – On invalid arguments or missing per-tool kwargs.

  • RuntimeError – If the tool’s Python dependency is not installed (AFragmenter via [pro]), if chainsaw_path is invalid, or if any entry failed under on_failure='raise'.

Examples

get_domains runs a structural domain-segmentation tool (afragmenter, which needs an AlphaFold PAE sidecar, or chainsaw) and returns df_seq with a per-entry chopping string of domain residue ranges — the raw, inspectable segmentation that encode_domains then turns into features.

Requires the external tool — shown illustratively.

import aaanalysis as aa

# stp = aa.StructurePreprocessor()
# df_dom = stp.get_domains(df_seq=df_seq, pae_folder='pae_files/',
#                          tool='afragmenter', resolution=0.7)
# df_dom[['entry', 'chopping']]

Further parameters. StructurePreprocessor.get_domains also accepts: pdb_folder — Directory with one <entry>.pdb / .cif / .pdb.gz / .cif.gz per row; chainsaw_path — Local clone of the ChainSaw repository; threshold — AFragmenter PAE graph-edge threshold in Å (only used when tool='afragmenter'); on_failure — What to do for entries where the tool fails / file is missing.