SequenceFeature.get_df_parts_from_windows

static SequenceFeature.get_df_parts_from_windows(dict_parts)[source]

Assemble a df_parts from per-part window sets (e.g. AAWindowSampler outputs).

Builds a reference df_parts by stitching one window set per sequence part, so each part can be generated with its own recipe. This unlocks biologically-motivated reference backgrounds where the parts differ in physicochemical prior, e.g. a coil-propensity JMD-N, an alpha-helix TMD, and a coil-propensity JMD-C, each produced by a separate call to AAWindowSampler.sample_synthetic() with a different generator and window_size. The assembled frame is used as the reference class for CPP exactly like a real df_parts. This method does not sample sequences itself; it only consumes window sets produced by AAWindowSampler.

Added in version 1.1.0.

Parameters:: dict_parts (dict) – Dictionary mapping each part name (one of aaanalysis.utils.LIST_ALL_PARTS, e.g. 'jmd_n', 'tmd', 'jmd_c') to its window set. Each value is either a DataFrame with a 'window' column (the output of AAWindowSampler.sample_synthetic()) or a sequence of window strings. All window lists must be in the same order across parts (the i-th window of each part forms the i-th reference row); differing orders silently break the biological meaning of the assembled rows.
Returns:: df_parts – Reference parts with one column per key in dict_parts and an index of 'REF<i>' identifiers.
Return type:: pd.DataFrame, shape (n_windows, n_parts)

Notes

If the parts supply different numbers of windows, a RuntimeWarning is issued and all parts are truncated to the smallest count.
Concatenate the result with a real df_parts (matching columns) and label the two groups before calling CPP.run().

See also

aaanalysis.AAWindowSampler: produces the per-part window sets (sample_synthetic).
aaanalysis.CPP: consumes the assembled df_parts via CPP.run().

Examples

When there is no natural reference class, generate one with :class:AAWindowSampler and assemble it into a df_parts with get_df_parts_from_windows. Each part can use its own generator, so the background carries part-specific physicochemical priors — here a coil-propensity JMD-N, an alpha-helix TMD, and a coil-propensity JMD-C. This method only consumes window sets; it does not sample sequences itself:

import pandas as pd
import aaanalysis as aa

df_seq = aa.load_dataset(name="DOM_GSEC", n=8)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"])
n_ref = len(df_parts)

aaws = aa.AAWindowSampler(random_state=0)
dict_parts = {
    "jmd_n": aaws.sample_synthetic(df_seq=df_seq, n=n_ref, window_size=10, generator="coil"),
    "tmd":   aaws.sample_synthetic(df_seq=df_seq, n=n_ref, window_size=20, generator="alpha_helix"),
    "jmd_c": aaws.sample_synthetic(df_seq=df_seq, n=n_ref, window_size=10, generator="coil"),
}
df_parts_ref = aa.SequenceFeature.get_df_parts_from_windows(dict_parts)
df_parts_ref.head()

	jmd_n	tmd	jmd_c
REF0	QGCATWPSPY	NFAASWMQMWSATAREVMGK	QGCATWPSPY
REF1	TAVASEVPHL	ADPPNIYYQPQIDQLGLVWH	TAVASEVPHL
REF2	ADRQQKYYRQ	MHMHIVENCTSFVAHDKSEA	ADRQQKYYRQ
REF3	RKDRNHNWYI	IECMGPEWHCNWKWLKNYWK	RKDRNHNWYI
REF4	PIPIKWGQCT	RLLSIRQWDQWYAVYWDYVT	PIPIKWGQCT

Concatenate the reference parts with the real parts as the reference class (0) and run CPP (a smaller split_kws suits the short base parts):

df_all = pd.concat([df_parts, df_parts_ref])
labels = [1] * len(df_parts) + [0] * len(df_parts_ref)
split_kws = sf.get_split_kws(n_split_max=5, steps_pattern=[3, 4], n_min=2, n_max=3, len_max=8)
aa.CPP(df_parts=df_all, split_kws=split_kws).run(labels=labels, n_filter=5)[["feature", "abs_auc"]].head()

[94m1. CPP creates 107238 features for 32 samples
1.1 Assigning scale values to parts[0m
   |.........................| 100.0%[0m[94m
1.2 Streaming pre-filter stats (mask in stream)[0m
   |.........................| 100.0%[0m
[94m2. CPP pre-filters 5361 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=92755 of 107238)[0m
[94m3. CPP filtering algorithm[0m
[94m4. CPP returns df of 5 unique features with general information and statistics[0m

	feature	abs_auc
0	TMD-Segment(4,5)-NAKH920108	0.5
1	TMD-PeriodicPattern(N,i+3/4,2)-CHAM830108	0.5
2	TMD-PeriodicPattern(C,i+4/3,3)-FUKS010105	0.5
3	TMD-PeriodicPattern(C,i+3/3,1)-CHOC760104	0.5
4	TMD-Segment(4,5)-KOEH090104	0.5

What can go wrong? Window lists must align by position; differing counts warn and truncate to the shortest, and unknown part names raise:

import warnings
with warnings.catch_warnings():
    warnings.simplefilter("always")
    df = aa.SequenceFeature.get_df_parts_from_windows({"jmd_n": ["AAAA", "CCCC", "DDDD"], "tmd": ["EEEEEE", "FFFFFF"]})
    print("truncated to", len(df), "rows")
try:
    aa.SequenceFeature.get_df_parts_from_windows({"not_a_part": ["AAAA", "CCCC"]})
except ValueError as e:
    print("ValueError:", e)

truncated to 2 rows
ValueError: 'dict_parts' key 'not_a_part' should be one of: ['tmd', 'tmd_e', 'tmd_n', 'tmd_c', 'jmd_n', 'jmd_c', 'ext_c', 'ext_n', 'tmd_jmd', 'jmd_n_tmd_n', 'tmd_c_jmd_c', 'ext_n_tmd_n', 'tmd_c_ext_c']

/Users/stephanbreimann/Programming/1Packages/aaanalysis/aaanalysis/feature_engineering/_sequence_feature.py:1233: RuntimeWarning: window counts differ across parts {'jmd_n': 3, 'tmd': 2}; truncating to 2.
  warnings.warn(f"window counts differ across parts {counts}; truncating to {n_windows}.",