StructurePreprocessor.get_domains
- StructurePreprocessor.get_domains(df_seq, pdb_folder=None, pae_folder=None, tool='afragmenter', chainsaw_path=None, resolution=0.7, threshold=2.0, on_failure='nan')[source]
Run a domain-segmentation tool and append a
choppingcolumn.Mirrors the
get_dssp→encode_dssppattern: this method runs the external tool inline, returns a copy ofdf_seqwith appendedchopping(the common Merizo [Lau23] / Chainsaw [Wells24] chopping-string format) anddomain_okboolean columns. The result feeds intoencode_domains()(which now accepts the in-memorychoppingcolumn directly).Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences.pdb_folder (str or pathlib.Path, optional) – Directory with one
<entry>.pdb/.cif/.pdb.gz/.cif.gzper row. Required whentool='chainsaw'.pae_folder (str or pathlib.Path, optional) – Directory with one PAE JSON per row (same canonical filename resolution as
encode_pae()). Required whentool='afragmenter'.tool ({'chainsaw', 'afragmenter'}, default='afragmenter') –
Which segmentation tool to run.
'afragmenter'([Verwimp25]): a schema-free, tuneable segmenter that builds a residue network from the AlphaFold Predicted Aligned Error (PAE) matrix and finds domains by Leiden clustering. Pip-installable; requires the optional extrapip install aaanalysis[pro](lazy-import; the friendly install hint fires only when this tool is requested). Operates on the PAE matrix frompae_folder.'chainsaw'([Wells24]): a fully-convolutional neural network that predicts domain boundaries from a PDB / CIF structure. Not on PyPI; clone the repo locally and pass its directory aschainsaw_path. Operates on PDB / CIF frompdb_foldervia subprocess.
chainsaw_path (str or pathlib.Path, optional) – Local clone of the ChainSaw repository. Required when
tool='chainsaw'(ignored otherwise).resolution (float, default=0.7) – AFragmenter Leiden-resolution knob (only used when
tool='afragmenter').threshold (float, default=2.0) – AFragmenter PAE graph-edge threshold in Å (only used when
tool='afragmenter').on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries where the tool fails / file is missing.
'nan'fillschoppingwith an empty string and marksdomain_ok=False;'drop'removes the row;'raise're-raises.
- Returns:
df_out – A copy of
df_seqwith two appended columns: *chopping(str): the Merizo/ChainSaw common-formatchopping string, or
''on failure.domain_ok(bool):Trueif the tool returned a non-empty chopping for this entry.
- Return type:
pd.DataFrame
- Raises:
ValueError – On invalid arguments or missing per-tool kwargs.
RuntimeError – If the tool’s Python dependency is not installed (AFragmenter via
[pro]), ifchainsaw_pathis invalid, or if any entry failed underon_failure='raise'.
Examples
get_domainsruns a structural domain-segmentation tool (afragmenter, which needs an AlphaFold PAE sidecar, orchainsaw) and returnsdf_seqwith a per-entrychoppingstring of domain residue ranges — the raw, inspectable segmentation thatencode_domainsthen turns into features.Requires the external tool — shown illustratively.
import aaanalysis as aa # stp = aa.StructurePreprocessor() # df_dom = stp.get_domains(df_seq=df_seq, pae_folder='pae_files/', # tool='afragmenter', resolution=0.7) # df_dom[['entry', 'chopping']]
Further parameters.
StructurePreprocessor.get_domainsalso accepts:pdb_folder— Directory with one<entry>.pdb/.cif/.pdb.gz/.cif.gzper row;chainsaw_path— Local clone of the ChainSaw repository;threshold— AFragmenter PAE graph-edge threshold in Å (only used whentool='afragmenter');on_failure— What to do for entries where the tool fails / file is missing.