aaanalysis.StructurePreprocessor.encode_domains
- StructurePreprocessor.encode_domains(df_seq=None, domain_folder=None, features=None, on_failure='nan', return_df=False, verbose=None)[source]
Read pre-computed domain segmentation files into
dict_domains.Bring-your-own-segmentation: the user pre-runs Merizo / ChainSaw / AFragmenter / a hand-curated domain table on their PDB files and saves the chopping string (Merizo/ChainSaw native format) to one file per entry in
domain_folder. Two file formats are accepted by the resolver (looked up by entry name):<entry>.txt— first non-empty line is the chopping string, e.g.6-18_296-459,19-156.<entry>.tsv— Merizo/ChainSaw TSV output with achoppingheader (first data row used).
The chopping format: domains separated by commas, segments within a domain separated by underscores; segments are 1-based inclusive
start-end. Discontinuous domains supported.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn.len(sequence)is used as the L for each per-residue output tensor.domain_folder (str or pathlib.Path) – Directory containing one chopping file per row of
df_seq.features (list of str) – Feature keys belonging to
encode_domains: any subset of{domain_boundary, domain_relative_position, domain_size, n_domains_in_protein}. All outputs normalized to[0, 1](NaN for residues unassigned to any domain).on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose chopping file is missing or unparseable.
'nan'fills with NaN-only tensors;'drop'removes those entries;'raise're-raises.return_df (bool, default=False) – If
True, also return the per-row status DataFrame as a second element(dict_num, df_seq_out). IfFalse(default), return onlydict_num.verbose (bool, optional) – Override instance verbosity for this call only.
- Returns:
dict_domains (dict[str, np.ndarray]) –
{entry: (L_entry, D_total) ndarray}per-residue domain-derived features in the order offeatures.df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seqplus a booleandomain_okcolumn.
- Raises:
ValueError – On invalid arguments or feature keys not in this method’s registry slice.
RuntimeError – If any entry failed under
on_failure='raise'.
Notes
AAanalysis deliberately does NOT bundle a segmentation tool runtime (no PyTorch, no model weights, no Merizo / ChainSaw / AFragmenter pinned). Keep
aaanalysis[pro]lean; pre-run the tool of your choice, then ingest its chopping output here.Merizo: https://github.com/psipred/Merizo (~2 s per 425-residue chain on CPU, bundled weights, pip-installable).
ChainSaw: https://github.com/JudeWells/Chainsaw (manual install).
Output chopping strings: same chopping column in both tools’ TSV output, drop-in compatible.