SequenceFeature.get_df_parts_from_windows
- static SequenceFeature.get_df_parts_from_windows(dict_parts=None)[source]
Assemble a
df_partsfrom per-part window sets (e.g.AAWindowSampleroutputs).Builds a reference
df_partsby stitching one window set per sequence part, so each part can be generated with its own recipe. This unlocks biologically-motivated reference backgrounds where the parts differ in physicochemical prior, e.g. a coil-propensity JMD-N, an alpha-helix TMD, and a coil-propensity JMD-C, each produced by a separate call toAAWindowSampler.sample_synthetic()with a differentgeneratorandwindow_size. The assembled frame is used as the reference class forCPPexactly like a realdf_parts. This method does not sample sequences itself; it only consumes window sets produced byAAWindowSampler.Added in version 1.1.0.
- Parameters:
dict_parts (dict) – Dictionary mapping each part name (one of
aaanalysis.utils.LIST_ALL_PARTS, e.g.'jmd_n','tmd','jmd_c') to its window set. Each value is either a DataFrame with a'window'column (the output ofAAWindowSampler.sample_synthetic()) or a sequence of window strings. All window lists must be in the same order across parts (the i-th window of each part forms the i-th reference row); differing orders silently break the biological meaning of the assembled rows.- Returns:
df_parts – Reference parts with one column per key in
dict_partsand an index of'REF<i>'identifiers.- Return type:
pd.DataFrame, shape (n_windows, n_parts)
Notes
If the parts supply different numbers of windows, a
RuntimeWarningis issued and all parts are truncated to the smallest count.Concatenate the result with a real
df_parts(matching columns) and label the two groups before callingCPP.run().
See also
aaanalysis.AAWindowSampler: produces the per-part window sets (sample_synthetic).aaanalysis.CPP: consumes the assembleddf_partsviaCPP.run().
Examples
When there is no natural reference class, generate one with :class:
AAWindowSamplerand assemble it into adf_partswithget_df_parts_from_windows. Each part can use its own generator, so the background carries part-specific physicochemical priors — here a coil-propensity JMD-N, an alpha-helix TMD, and a coil-propensity JMD-C. This method only consumes window sets; it does not sample sequences itself:import pandas as pd import aaanalysis as aa df_seq = aa.load_dataset(name="DOM_GSEC", n=8) sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["jmd_n", "tmd", "jmd_c"]) n_ref = len(df_parts) aaws = aa.AAWindowSampler(random_state=0) dict_parts = { "jmd_n": aaws.sample_synthetic(df_seq=df_seq, n=n_ref, window_size=10, generator="coil"), "tmd": aaws.sample_synthetic(df_seq=df_seq, n=n_ref, window_size=20, generator="alpha_helix"), "jmd_c": aaws.sample_synthetic(df_seq=df_seq, n=n_ref, window_size=10, generator="coil"), } df_parts_ref = aa.SequenceFeature.get_df_parts_from_windows(dict_parts) df_parts_ref.head()
jmd_n tmd jmd_c REF0 QGCATWPSPY NFAASWMQMWSATAREVMGK QGCATWPSPY REF1 TAVASEVPHL ADPPNIYYQPQIDQLGLVWH TAVASEVPHL REF2 ADRQQKYYRQ MHMHIVENCTSFVAHDKSEA ADRQQKYYRQ REF3 RKDRNHNWYI IECMGPEWHCNWKWLKNYWK RKDRNHNWYI REF4 PIPIKWGQCT RLLSIRQWDQWYAVYWDYVT PIPIKWGQCT Concatenate the reference parts with the real parts as the reference class (0) and run CPP (a smaller
split_kwssuits the short base parts):df_all = pd.concat([df_parts, df_parts_ref]) labels = [1] * len(df_parts) + [0] * len(df_parts_ref) split_kws = sf.get_split_kws(n_split_max=5, steps_pattern=[3, 4], n_min=2, n_max=3, len_max=8) aa.CPP(df_parts=df_all, split_kws=split_kws).run(labels=labels, n_filter=5)[["feature", "abs_auc"]].head()
[94m1. CPP creates 107238 features for 32 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 5361 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=92755 of 107238)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 5 unique features with general information and statistics[0m
feature abs_auc 0 TMD-Segment(4,5)-NAKH920108 0.5 1 TMD-PeriodicPattern(N,i+3/4,2)-CHAM830108 0.5 2 TMD-PeriodicPattern(C,i+4/3,3)-FUKS010105 0.5 3 TMD-PeriodicPattern(C,i+3/3,1)-CHOC760104 0.5 4 TMD-Segment(4,5)-KOEH090104 0.5 What can go wrong? Window lists must align by position; differing counts warn and truncate to the shortest, and unknown part names raise:
import warnings with warnings.catch_warnings(): warnings.simplefilter("always") df = aa.SequenceFeature.get_df_parts_from_windows({"jmd_n": ["AAAA", "CCCC", "DDDD"], "tmd": ["EEEEEE", "FFFFFF"]}) print("truncated to", len(df), "rows") try: aa.SequenceFeature.get_df_parts_from_windows({"not_a_part": ["AAAA", "CCCC"]}) except ValueError as e: print("ValueError:", e)
truncated to 2 rows ValueError: 'dict_parts' key 'not_a_part' should be one of: ['tmd', 'tmd_e', 'tmd_n', 'tmd_c', 'jmd_n', 'jmd_c', 'ext_c', 'ext_n', 'tmd_jmd', 'jmd_n_tmd_n', 'tmd_c_jmd_c', 'ext_n_tmd_n', 'tmd_c_ext_c']
/Users/stephanbreimann/Programming/1Packages/aaanalysis/aaanalysis/feature_engineering/_sequence_feature.py:1233: RuntimeWarning: window counts differ across parts {'jmd_n': 3, 'tmd': 2}; truncating to 2. warnings.warn(f"window counts differ across parts {counts}; truncating to {n_windows}.",