NumericalFeature.get_parts

static NumericalFeature.get_parts(df_seq, dict_num, list_parts=None, all_parts=False, jmd_n_len=10, jmd_c_len=10, tmd_len=None)[source]

Prepare Comparative Physicochemical Profiling (CPP) numerical-mode inputs by slicing sequences AND per-residue tensors with shared boundaries.

Numerical analog of SequenceFeature.get_df_parts() for the CPP.run_num workflow: the same (start, end) boundaries used to slice the sequence STRINGS into parts are reused to slice each entry’s dict_num[entry] per-residue tensor along the L axis. Returns both results from one call so the user never has to pass df_seq + tmd_len + jmd_n_len + jmd_c_len to two separate helpers.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Must also carry tmd_start / tmd_stop columns (the position-based schema) so the slicing boundaries can be computed.
dict_num (dict[str, np.ndarray]) – Mapping entry -> (L, D) per-residue numerical tensor, where L matches len(df_seq.loc[entry, 'sequence']) and D is consistent across all entries. Source: protein language model (PLM) embeddings, DSSP one-hots, post-translational modification (PTM) dummies, or any other per-residue numerical representation.
list_parts (list of str, optional) – Subset of part names to materialize (e.g. ["tmd", "jmd_n_tmd_n", "tmd_c_jmd_c"]). Defaults to the same default as SequenceFeature.get_df_parts().
all_parts (bool, default=False) – If True, return all available parts; ignored when list_parts is supplied.
jmd_n_len (int, default=10) – Length of JMD-N (>=0).
jmd_c_len (int, default=10) – Length of JMD-C (>=0).
tmd_len (int, optional) – Target middle domain (TMD) length for the anchor-based format only (a sequence + pos df_seq): each 1-based anchor in pos is exploded into one row with the TMD centered (right-heavy for even tmd_len) on the anchor, and the matching dict_num tensor is sliced with the same boundaries. Ignored for the position-based schema.

Returns:

df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Per-part sequence STRINGS, same shape as SequenceFeature.get_df_parts()’s output. Pass directly to CPP(df_parts=df_parts, ...).
dict_num_parts (dict[str, np.ndarray]) – Per-part NaN-padded numerical tensors. Each value has shape (n_samples, L_part_max, D) aligned row-for-row with the df_parts index. Pass directly to cpp.run_num(dict_num_parts=dict_num_parts, ...).

Raises:

ValueError – If dict_num is missing an entry from df_seq, any dict_num[entry] has the wrong row count vs the sequence length, or D varies across entries.

Notes

Call order — ``get_parts`` then ``run_num``. This is step 1 of the two-step numerical-mode workflow: it only slices df_seq + dict_num into (df_parts, dict_num_parts). Pass df_parts to the CPP constructor and dict_num_parts to CPP.run_num() (step 2); run_num has no raw-df_seq / dict_num entry point, so this order is the only supported one.
``dict_num`` must already be ``[0, 1]``-normalized. get_parts slices values verbatim — it does NOT rescale. The max_std_test pre-filter in CPP.run_num() is calibrated for the [0, 1] range; passing unbounded values (e.g. raw protein language model embeddings) leaves that pre-filter miscalibrated, so the feature funnel silently keeps/drops the wrong features (no error is raised). Normalize first — e.g. via EmbeddingPreprocessor.encode(), StructurePreprocessor, or AnnotationPreprocessor, all of which emit [0, 1] dict_num.
dict_num_parts carries NaN padding at the trailing rows for entries whose JMD doesn’t fit the requested length. The corresponding per-part string in df_parts also pads with '-' (gap), so the two outputs stay aligned. The real per-(entry, part) length is recoverable as the non-gap character count in df_parts.
For seq-only mode (no per-residue tensor), use SequenceFeature.get_df_parts() directly; NumericalFeature.get_parts is only useful when you have a dict_num to slice.

See also

SequenceFeature.get_df_parts() — string-only analog.
CPP.run_num() — consumes df_parts (via constructor) and dict_num_parts (per call).

Examples

NumericalFeature.get_parts slices a per-residue dict_num into the jmd_n / tmd / jmd_c parts that :meth:CPP.run_num consumes, returning the (df_parts, dict_num_parts) pair aligned to df_seq.

import numpy as np
import pandas as pd
import aaanalysis as aa
aa.options["verbose"] = False

seqs = ["ACDEFGHIKLMNPQRSTVWY" * 3] * 4   # length 60
df_seq = pd.DataFrame({"entry": [f"P{i}" for i in range(4)], "sequence": seqs,
                       "tmd_start": 11, "tmd_stop": 50})
rng = np.random.default_rng(0)
dict_num = {e: rng.random((60, 4)) for e in df_seq["entry"]}

df_parts, dict_num_parts = aa.NumericalFeature().get_parts(df_seq=df_seq, dict_num=dict_num)
df_parts.head()

	tmd	jmd_n_tmd_n	tmd_c_jmd_c
entry
P0	MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	MNPQRSTVWYACDEFGHIKLMNPQRSTVWY
P1	MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	MNPQRSTVWYACDEFGHIKLMNPQRSTVWY
P2	MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	MNPQRSTVWYACDEFGHIKLMNPQRSTVWY
P3	MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	MNPQRSTVWYACDEFGHIKLMNPQRSTVWY

dict_num_parts carries the per-part tensors aligned to df_parts:

{e: v.shape for e, v in list(dict_num_parts.items())[:2]}

{'tmd': (4, 40, 4), 'jmd_n_tmd_n': (4, 30, 4)}

Further parameters. NumericalFeature.get_parts also accepts: list_parts — Subset of part names to materialize (e.g; all_parts — If True, return all available parts; ignored when list_parts is supplied; jmd_n_len — Length of JMD-N (>=0); jmd_c_len — Length of JMD-C (>=0); tmd_len — Target middle domain (TMD) length for the anchor-based format only (a sequence + pos df_seq): each 1-based anchor in pos is exploded into one row with the TMD centered (right-heavy for even tmd_len) on the anchor, and the matching dict_num tensor is sliced with the same boundaries.

nf = aa.NumericalFeature()
# all_parts=True returns every available part; jmd_n_len / jmd_c_len set the flank lengths
df_parts_all, _ = nf.get_parts(df_seq=df_seq, dict_num=dict_num,
                               all_parts=True, jmd_n_len=10, jmd_c_len=10)
print("all parts:", list(df_parts_all.columns))
# list_parts selects a subset of parts (and overrides all_parts)
df_parts_tmd, _ = nf.get_parts(df_seq=df_seq, dict_num=dict_num, list_parts=["tmd"])
print("subset:", list(df_parts_tmd.columns))
# tmd_len drives the anchor-based format (a 'sequence' + 'pos' df_seq): each 1-based anchor
# is exploded into a TMD of length tmd_len centered on it
df_seq_anchor = pd.DataFrame({"entry": [f"P{i}" for i in range(4)], "sequence": seqs,
                              "pos": [20, 25, 30, 35]})
df_parts_anchor, dict_num_parts_anchor = nf.get_parts(df_seq=df_seq_anchor, dict_num=dict_num,
                                                      tmd_len=10)
aa.display_df(df_parts_anchor, n_rows=10, show_shape=True)

all parts: ['tmd', 'tmd_n', 'tmd_c', 'jmd_n', 'jmd_c', 'tmd_jmd', 'jmd_n_tmd_n', 'tmd_c_jmd_c']
subset: ['tmd']
DataFrame shape: (4, 3)

	tmd	jmd_n_tmd_n	tmd_c_jmd_c
P0_6-35	STVWYACDEF	GHIKLMNPQRSTVWY	ACDEFGHIKLMNPQR
P1_11-40	ACDEFGHIKL	MNPQRSTVWYACDEF	GHIKLMNPQRSTVWY
P2_16-45	GHIKLMNPQR	STVWYACDEFGHIKL	MNPQRSTVWYACDEF
P3_21-50	MNPQRSTVWY	ACDEFGHIKLMNPQR	STVWYACDEFGHIKL