aaanalysis.NumericalFeature.get_parts
- static NumericalFeature.get_parts(df_seq=None, dict_num=None, list_parts=None, all_parts=False, jmd_n_len=10, jmd_c_len=10, tmd_len=None)[source]
Prepare CPP numerical-mode inputs by slicing sequences AND per-residue tensors with shared boundaries.
Numerical analog of
SequenceFeature.get_df_parts()for theCPP.run_numworkflow: the same (start, end) boundaries used to slice the sequence STRINGS into parts are reused to slice each entry’sdict_num[entry]per-residue tensor along the L axis. Returns both results from one call so the user never has to passdf_seq + tmd_len + jmd_n_len + jmd_c_lento two separate helpers.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Must also carrytmd_start/tmd_stopcolumns (the position-based schema) so the slicing boundaries can be computed.dict_num (dict[str, np.ndarray]) – Mapping
entry -> (L, D)per-residue numerical tensor, whereLmatcheslen(df_seq.loc[entry, 'sequence'])andDis consistent across all entries. Source: PLM embeddings, DSSP one-hots, PTM dummies, or any other per-residue numerical representation.list_parts (list of str, optional) – Subset of part names to materialize (e.g.
["tmd", "jmd_n_tmd_n", "tmd_c_jmd_c"]). Defaults to the same default asSequenceFeature.get_df_parts().all_parts (bool, default=False) – If
True, return all available parts; ignored whenlist_partsis supplied.jmd_n_len (int, default=10) – Length of JMD-N (>=0).
jmd_c_len (int, default=10) – Length of JMD-C (>=0).
tmd_len (int, optional) – TMD length for the anchor-based format only (a
sequence+posdf_seq): each 1-based anchor inposis exploded into one row with the TMD centered (right-heavy for eventmd_len) on the anchor, and the matchingdict_numtensor is sliced with the same boundaries. Ignored for the position-based schema.
- Returns:
df_parts (pd.DataFrame, shape (n_samples, n_parts)) – Per-part sequence STRINGS, same shape as
SequenceFeature.get_df_parts()’s output. Pass directly toCPP(df_parts=df_parts, ...).dict_num_parts (dict[str, np.ndarray]) – Per-part NaN-padded numerical tensors. Each value has shape
(n_samples, L_part_max, D)aligned row-for-row with thedf_partsindex. Pass directly tocpp.run_num(dict_num_parts=dict_num_parts, ...).
- Raises:
ValueError – If
dict_numis missing an entry fromdf_seq, anydict_num[entry]has the wrong row count vs the sequence length, or D varies across entries.
Notes
dict_num_partscarries NaN padding at the trailing rows for entries whose JMD doesn’t fit the requested length. The corresponding per-part string indf_partsalso pads with'-'(gap), so the two outputs stay aligned. The real per-(entry, part) length is recoverable as the non-gap character count indf_parts.For seq-only mode (no per-residue tensor), use
SequenceFeature.get_df_parts()directly;NumericalFeature.get_partsis only useful when you have adict_numto slice.
See also
SequenceFeature.get_df_parts()— string-only analog.CPP.run_num()— consumesdf_parts(via constructor) anddict_num_parts(per call).
Examples
NumericalFeature.get_partsslices a per-residuedict_numinto the jmd_n / tmd / jmd_c parts that :meth:CPP.run_numconsumes, returning the(df_parts, dict_num_parts)pair aligned todf_seq.import numpy as np import pandas as pd import aaanalysis as aa aa.options["verbose"] = False seqs = ["ACDEFGHIKLMNPQRSTVWY" * 3] * 4 # length 60 df_seq = pd.DataFrame({"entry": [f"P{i}" for i in range(4)], "sequence": seqs, "tmd_start": 11, "tmd_stop": 50}) rng = np.random.default_rng(0) dict_num = {e: rng.random((60, 4)) for e in df_seq["entry"]} df_parts, dict_num_parts = aa.NumericalFeature().get_parts(df_seq=df_seq, dict_num=dict_num) df_parts.head()
tmd jmd_n_tmd_n tmd_c_jmd_c entry P0 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY P1 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY P2 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY P3 MNPQRSTVWYACDEFGHIKLMNPQRSTVWYACDEFGHIKL ACDEFGHIKLMNPQRSTVWYACDEFGHIKL MNPQRSTVWYACDEFGHIKLMNPQRSTVWY dict_num_partscarries the per-part tensors aligned todf_parts:{e: v.shape for e, v in list(dict_num_parts.items())[:2]}
{'tmd': (4, 40, 4), 'jmd_n_tmd_n': (4, 30, 4)}