aaanalysis.StructurePreprocessor.encode_domains

StructurePreprocessor.encode_domains(df_seq=None, domain_folder=None, features=None, on_failure='nan', return_df=False, verbose=None)[source]

Read pre-computed domain segmentation files into dict_domains.

Bring-your-own-segmentation: the user pre-runs Merizo / ChainSaw / AFragmenter / a hand-curated domain table on their PDB files and saves the chopping string (Merizo/ChainSaw native format) to one file per entry in domain_folder. Two file formats are accepted by the resolver (looked up by entry name):

  • <entry>.txt — first non-empty line is the chopping string, e.g. 6-18_296-459,19-156.

  • <entry>.tsv — Merizo/ChainSaw TSV output with a chopping header (first data row used).

The chopping format: domains separated by commas, segments within a domain separated by underscores; segments are 1-based inclusive start-end. Discontinuous domains supported.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column. len(sequence) is used as the L for each per-residue output tensor.

  • domain_folder (str or pathlib.Path) – Directory containing one chopping file per row of df_seq.

  • features (list of str) – Feature keys belonging to encode_domains: any subset of {domain_boundary, domain_relative_position, domain_size, n_domains_in_protein}. All outputs normalized to [0, 1] (NaN for residues unassigned to any domain).

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose chopping file is missing or unparseable. 'nan' fills with NaN-only tensors; 'drop' removes those entries; 'raise' re-raises.

  • return_df (bool, default=False) – If True, also return the per-row status DataFrame as a second element (dict_num, df_seq_out). If False (default), return only dict_num.

  • verbose (bool, optional) – Override instance verbosity for this call only.

Returns:

  • dict_domains (dict[str, np.ndarray]) – {entry: (L_entry, D_total) ndarray} per-residue domain-derived features in the order of features.

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean domain_ok column.

Raises:
  • ValueError – On invalid arguments or feature keys not in this method’s registry slice.

  • RuntimeError – If any entry failed under on_failure='raise'.

Notes

  • AAanalysis deliberately does NOT bundle a segmentation tool runtime (no PyTorch, no model weights, no Merizo / ChainSaw / AFragmenter pinned). Keep aaanalysis[pro] lean; pre-run the tool of your choice, then ingest its chopping output here.

  • Merizo: https://github.com/psipred/Merizo (~2 s per 425-residue chain on CPU, bundled weights, pip-installable).

  • ChainSaw: https://github.com/JudeWells/Chainsaw (manual install).

  • Output chopping strings: same chopping column in both tools’ TSV output, drop-in compatible.