AAWindowSampler.sample_benchmark_set
- AAWindowSampler.sample_benchmark_set(df_seq=None, arms=None, seed=None)[source]
Run several named sampling arms and concatenate them into one benchmark set.
Thin multi-arm orchestrator over the individual
sample_*methods: it adds no new sampling behavior. Each arm is one ordinarysample_*call in'segments'mode, tagged with its arm name in an extraarmcolumn so a downstream benchmark can consume any mix of arms uniformly.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Passed to every arm.arms (dict) – Mapping
{arm_name: {"method": <strategy>, **kwargs}}.methodis one of'same_protein','different_protein','synthetic','motif_matched'(ut.LIST_STRATEGIES); the remaining keys forward as keyword arguments to the matchingsample_*method. The reserved keysdf_seq,seed, andoutput_modeare managed here and must not appear in an arm config.seed (int, optional) – Master seed; falls back to the class-level
random_state. Per-arm sub-seeds are derived deterministically vianumpy.random.SeedSequence, so identicalseedvalues reproduce identical benchmark sets.
- Returns:
df_seq_out – Row-wise concatenation of every arm’s
'segments'output with an addedarmcolumn. No automatic cross-arm dedupe — every sampled row is preserved. Deduplicate protein-sourced windows onentry_winand synthetic windows onwindowif needed.- Return type:
pd.DataFrame
Notes
roleandstrategytags set by each arm are preserved through the concatenation; together witharmthey carry full row provenance. Amotif_matchedarm adds amotif_scorecolumn, which isNaNfor rows from other arms.Examples
Run several named sampling arms in one call and stack them into a single benchmark
DataFrame. This is a thin multi-arm orchestrator — each arm is one ordinarysample_*call in'segments'mode — so a downstream benchmark can consume any mix of negatives, unlabeled, and control rows uniformly.import aaanalysis as aa import pandas as pd aa.options["verbose"] = False df_seq = pd.DataFrame({ "entry": ["P1", "P2", "P3"], "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2, "YWVTSRQPNMLKIHGFEDCA" * 2, "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ"], "pos": [[5, 25], [15], []], }) aaws = aa.AAWindowSampler(random_state=0)
armsis a mapping{arm_name: {"method": <strategy>, **kwargs}}.methodis one of'same_protein','different_protein','synthetic','motif_matched'(ut.LIST_STRATEGIES); the remaining keys forward to the matchingsample_*method. The reserved keysdf_seq,seed, andoutput_modeare managed by the orchestrator and must not appear in an arm config.Each arm keeps its own
roleandstrategytags; an extraarmcolumn records which arm produced the row.arms = { "near_pos": {"method": "same_protein", "pos_col": "pos", "n": 4, "window_size": 5, "min_distance_to_pos": 2}, "unlabeled": {"method": "different_protein", "pos_col": "pos", "n": 4, "window_size": 5}, "control": {"method": "synthetic", "n": 3, "window_size": 5, "generator": "global_freq"}, } df = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=0) aa.display_df(df=df[["arm", "entry_win", "role", "strategy", "window"]], show_shape=True)
DataFrame shape: (11, 5)
arm entry_win role strategy window 1 near_pos P1_12-16 Negative same_protein NPQRS 2 near_pos P1_18-22 Negative same_protein VWYAC 3 near_pos P2_11-15 Negative same_protein LKIHG 4 near_pos P2_16-20 Negative same_protein FEDCA 5 unlabeled P3_26-30 Unlabeled different_protein RLGLI 6 unlabeled P3_3-7 Unlabeled different_protein TAYIA 7 unlabeled P3_20-24 Unlabeled different_protein SRQLE 8 unlabeled P3_14-18 Unlabeled different_protein FVKSH 9 control synth_0 Control synthetic:global_freq YKKKG 10 control synth_1 Control synthetic:global_freq FHGQA 11 control synth_2 Control synthetic:global_freq MALDK The output is a row-wise concatenation of every arm’s
'segments'output (eight-column schema +arm). There is no automatic cross-arm dedupe — every sampled row is preserved; deduplicate protein-sourced windows onentry_winand synthetic windows onwindowif needed.arm+strategy+roletogether carry full row provenance.seedis the master seed (falling back to the constructorrandom_state). Per-arm sub-seeds are derived deterministically withnumpy.random.SeedSequence, so the sameseedreproduces the same benchmark set:df_a = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=42) df_b = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=42) print("identical for equal seed:", df_a.equals(df_b))
identical for equal seed: True