aaanalysis.StructurePreprocessor.get_domains

StructurePreprocessor.get_domains(df_seq=None, pdb_folder=None, pae_folder=None, tool='afragmenter', chainsaw_path=None, resolution=0.7, threshold=2.0, on_failure='nan', verbose=None)[source]

Run a domain-segmentation tool and append a chopping column.

Mirrors the get_dsspencode_dssp pattern: this method runs the external tool inline, returns a copy of df_seq with appended chopping (Merizo/ChainSaw common format) and domain_ok boolean columns. The result feeds into encode_domains() (which now accepts the in-memory chopping column directly).

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences.

  • pdb_folder (str or pathlib.Path, optional) – Directory with one <entry>.pdb / .cif / .pdb.gz / .cif.gz per row. Required when tool='chainsaw'.

  • pae_folder (str or pathlib.Path, optional) – Directory with one PAE JSON per row (same canonical filename resolution as encode_pae()). Required when tool='afragmenter'.

  • tool ({'chainsaw', 'afragmenter'}, default='afragmenter') –

    Which segmentation tool to run.

    • 'afragmenter': pip-installable PAE-based segmenter. Requires the optional extra pip install aaanalysis[pro] (lazy-import; the friendly install hint fires only when this tool is requested). Operates on the PAE matrix from pae_folder.

    • 'chainsaw': PDB-based segmenter (Chainsaw). Not on PyPI; clone the repo locally and pass its directory as chainsaw_path. Operates on PDB / CIF from pdb_folder via subprocess.

  • chainsaw_path (str or pathlib.Path, optional) – Local clone of the ChainSaw repository. Required when tool='chainsaw' (ignored otherwise).

  • resolution (float, default=0.7) – AFragmenter Leiden-resolution knob (only used when tool='afragmenter').

  • threshold (float, default=2.0) – AFragmenter PAE graph-edge threshold in Å (only used when tool='afragmenter').

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries where the tool fails / file is missing. 'nan' fills chopping with an empty string and marks domain_ok=False; 'drop' removes the row; 'raise' re-raises.

  • verbose (bool, optional) – Override instance verbosity for this call only.

Returns:

df_out – A copy of df_seq with two appended columns: * chopping (str): the Merizo/ChainSaw common-format

chopping string, or '' on failure.

  • domain_ok (bool): True if the tool returned a non-empty chopping for this entry.

Return type:

pd.DataFrame

Raises:
  • ValueError – On invalid arguments or missing per-tool kwargs.

  • RuntimeError – If the tool’s Python dependency is not installed (AFragmenter via [pro]), if chainsaw_path is invalid, or if any entry failed under on_failure='raise'.