StructurePreprocessor.encode_domains
- StructurePreprocessor.encode_domains(df_seq, domain_folder=None, *, features, on_failure='nan', return_df=False)[source]
Read pre-computed domain segmentation files into
dict_domains.Bring-your-own-segmentation: the user pre-runs Merizo / ChainSaw / AFragmenter / a hand-curated domain table on their PDB files and saves the chopping string (Merizo/ChainSaw native format) to one file per entry in
domain_folder. Two file formats are accepted by the resolver (looked up by entry name):<entry>.txt— first non-empty line is the chopping string, e.g.6-18_296-459,19-156.<entry>.tsv— Merizo/ChainSaw TSV output with achoppingheader (first data row used).
The chopping format: domains separated by commas, segments within a domain separated by underscores; segments are 1-based inclusive
start-end. Discontinuous domains supported.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn.len(sequence)is used as the L for each per-residue output tensor.domain_folder (str or pathlib.Path) – Directory containing one chopping file per row of
df_seq.features (list of str) – Feature keys belonging to
encode_domains: any subset of{domain_boundary, domain_relative_position, domain_size, n_domains_in_protein}. All outputs normalized to[0, 1](NaN for residues unassigned to any domain).on_failure ({'nan', 'drop', 'raise'}, default='nan') – What to do for entries whose chopping file is missing or unparseable.
'nan'fills with NaN-only tensors;'drop'removes those entries;'raise're-raises.return_df (bool, default=False) – If
True, also return the per-row status DataFrame as a second element(dict_num, df_seq_out). IfFalse(default), return onlydict_num.
- Returns:
dict_domains (dict[str, np.ndarray]) –
{entry: (L_entry, D_total) ndarray}per-residue domain-derived features in the order offeatures.df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seqplus a booleandomain_okcolumn.
- Raises:
ValueError – On invalid arguments or feature keys not in this method’s registry slice.
RuntimeError – If any entry failed under
on_failure='raise'.
Notes
AAanalysis deliberately does NOT bundle a segmentation tool runtime (no PyTorch, no model weights, no Merizo / ChainSaw / AFragmenter pinned). Keep
aaanalysis[pro]lean; pre-run the tool of your choice, then ingest its chopping output here.Merizo [Lau23] (invariant-point-attention residue clustering): https://github.com/psipred/Merizo (~2 s per 425-residue chain on CPU, bundled weights, pip-installable).
ChainSaw [Wells24] (fully-convolutional boundary prediction): https://github.com/JudeWells/Chainsaw (manual install).
Output chopping strings: same chopping column in both tools’ TSV output, drop-in compatible.
Examples
encode_domainsturns domain-segmentation files (fromget_domains) into a[0, 1]-normalizeddict_num— per-residuedomain_boundary,domain_relative_position,domain_size,n_domains_in_protein.Requires the domain files produced upstream — shown illustratively.
import aaanalysis as aa # stp = aa.StructurePreprocessor() # dict_dom = stp.encode_domains(df_seq=df_seq, # domain_folder='domains/', # features=['domain_boundary', 'domain_relative_position'])
Further parameters.
StructurePreprocessor.encode_domainsalso accepts:on_failure— What to do for entries whose chopping file is missing or unparseable;return_df— IfTrue, also return the per-row status DataFrame as a second element(dict_num, df_seq_out).