aaanalysis.NumericalFeature.get_parts

static NumericalFeature.get_parts(df_seq=None, dict_num=None, list_parts=None, all_parts=False, jmd_n_len=10, jmd_c_len=10, tmd_len=None)[source]

Prepare CPP numerical-mode inputs by slicing sequences AND per-residue tensors with shared boundaries.

Numerical analog of SequenceFeature.get_df_parts() for the CPP.run_num workflow: the same (start, end) boundaries used to slice the sequence STRINGS into parts are reused to slice each entry’s dict_num[entry] per-residue tensor along the L axis. Returns both results from one call so the user never has to pass df_seq + tmd_len + jmd_n_len + jmd_c_len to two separate helpers.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Must also carry tmd_start / tmd_stop columns (the position-based schema) so the slicing boundaries can be computed.

  • dict_num (dict[str, np.ndarray]) – Mapping entry -> (L, D) per-residue numerical tensor, where L matches len(df_seq.loc[entry, 'sequence']) and D is consistent across all entries. Source: PLM embeddings, DSSP one-hots, PTM dummies, or any other per-residue numerical representation.

  • list_parts (list of str, optional) – Subset of part names to materialize (e.g. ["tmd", "jmd_n_tmd_n", "tmd_c_jmd_c"]). Defaults to the same default as SequenceFeature.get_df_parts().

  • all_parts (bool, default=False) – If True, return all available parts; ignored when list_parts is supplied.

  • jmd_n_len (int, default=10) – Length of JMD-N (>=0).

  • jmd_c_len (int, default=10) – Length of JMD-C (>=0).

  • tmd_len (int, optional) – TMD length for the anchor-based format only (a sequence + pos df_seq): each 1-based anchor in pos is exploded into one row with the TMD centered (right-heavy for even tmd_len) on the anchor, and the matching dict_num tensor is sliced with the same boundaries. Ignored for the position-based schema.

Returns:

  • df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Per-part sequence STRINGS, same shape as SequenceFeature.get_df_parts()’s output. Pass directly to CPP(df_parts=df_parts, ...).

  • dict_num_parts (dict[str, np.ndarray]) – Per-part NaN-padded numerical tensors. Each value has shape (n_samples, L_part_max, D) aligned row-for-row with the df_parts index. Pass directly to cpp.run_num(dict_num_parts=dict_num_parts, ...).

Raises:

ValueError – If dict_num is missing an entry from df_seq, any dict_num[entry] has the wrong row count vs the sequence length, or D varies across entries.

Notes

  • dict_num_parts carries NaN padding at the trailing rows for entries whose JMD doesn’t fit the requested length. The corresponding per-part string in df_parts also pads with '-' (gap), so the two outputs stay aligned. The real per-(entry, part) length is recoverable as the non-gap character count in df_parts.

  • For seq-only mode (no per-residue tensor), use SequenceFeature.get_df_parts() directly; NumericalFeature.get_parts is only useful when you have a dict_num to slice.

See also

Examples

NumericalFeature.get_parts slices a per-residue dict_num into the jmd_n / tmd / jmd_c parts that :meth:CPP.run_num consumes, returning the (df_parts, dict_num_parts) pair aligned to df_seq.

import numpy as np
import pandas as pd
import aaanalysis as aa
aa.options["verbose"] = False

seqs = ["ACDEFGHIKLMNPQRSTVWY" * 3] * 4   # length 60
df_seq = pd.DataFrame({"entry": [f"P{i}" for i in range(4)], "sequence": seqs,
                       "tmd_start": 11, "tmd_stop": 50})
rng = np.random.default_rng(0)
dict_num = {e: rng.random((60, 4)) for e in df_seq["entry"]}

df_parts, dict_num_parts = aa.NumericalFeature().get_parts(df_seq=df_seq, dict_num=dict_num)
df_parts.head()
tmd jmd_n_tmd_n tmd_c_jmd_c
entry
P0 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY
P1 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY
P2 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY
P3 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY

dict_num_parts carries the per-part tensors aligned to df_parts:

{e: v.shape for e, v in list(dict_num_parts.items())[:2]}
{'tmd': (4, 40, 4), 'jmd_n_tmd_n': (4, 30, 4)}