AAWindowSampler.sample_benchmark_set

AAWindowSampler.sample_benchmark_set(df_seq, arms, seed=None)[source]

Run several named sampling arms and concatenate them into one benchmark set.

Thin multi-arm orchestrator over the individual sample_* methods: it adds no new sampling behavior. Each arm is one ordinary sample_* call in 'segments' mode, tagged with its arm name in an extra arm column so a downstream benchmark can consume any mix of arms uniformly.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Passed to every arm.
arms (dict) – Mapping {arm_name: {"method": <strategy>, **kwargs}}. method is one of 'same_protein', 'different_protein', 'synthetic', 'motif_matched' (ut.LIST_STRATEGIES); the remaining keys forward as keyword arguments to the matching sample_* method. The reserved keys df_seq, seed, and output_mode are managed here and must not appear in an arm config.
seed (int, optional) – Master seed; falls back to the class-level random_state. Per-arm sub-seeds are derived deterministically via numpy.random.SeedSequence, so identical seed values reproduce identical benchmark sets.

Returns:

df_seq_out – Row-wise concatenation of every arm’s 'segments' output with an added arm column. No automatic cross-arm dedupe — every sampled row is preserved. Deduplicate protein-sourced windows on entry_win and synthetic windows on window if needed.

Return type:

pd.DataFrame

Notes

role and strategy tags set by each arm are preserved through the concatenation; together with arm they carry full row provenance. A motif_matched arm adds a motif_score column, which is NaN for rows from other arms.

Examples

Run several named sampling arms in one call and stack them into a single benchmark DataFrame. This is a thin multi-arm orchestrator — each arm is one ordinary sample_* call in 'segments' mode — so a downstream benchmark can consume any mix of negatives, unlabeled, and control rows uniformly.

import aaanalysis as aa
import pandas as pd
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2", "P3"],
    "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2,
                 "YWVTSRQPNMLKIHGFEDCA" * 2,
                 "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ"],
    "pos":      [[5, 25], [15], []],
})
aaws = aa.AAWindowSampler(random_state=0)

arms is a mapping {arm_name: {"method": <strategy>, **kwargs}}. method is one of 'same_protein', 'different_protein', 'synthetic', 'motif_matched' (ut.LIST_STRATEGIES); the remaining keys forward to the matching sample_* method. The reserved keys df_seq, seed, and output_mode are managed by the orchestrator and must not appear in an arm config.

Each arm keeps its own role and strategy tags; an extra arm column records which arm produced the row.

arms = {
    "near_pos":  {"method": "same_protein",     "pos_col": "pos", "n": 4, "window_size": 5, "min_distance_to_pos": 2},
    "unlabeled": {"method": "different_protein", "pos_col": "pos", "n": 4, "window_size": 5},
    "control":   {"method": "synthetic",         "n": 3, "window_size": 5, "generator": "global_freq"},
}
df = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=0)
aa.display_df(df=df[["arm", "entry_win", "role", "strategy", "window"]], show_shape=True)

DataFrame shape: (11, 5)

	arm	entry_win	role	strategy	window
1	near_pos	P1_12-16	Negative	same_protein	NPQRS
2	near_pos	P1_18-22	Negative	same_protein	VWYAC
3	near_pos	P2_11-15	Negative	same_protein	LKIHG
4	near_pos	P2_16-20	Negative	same_protein	FEDCA
5	unlabeled	P3_26-30	Unlabeled	different_protein	RLGLI
6	unlabeled	P3_3-7	Unlabeled	different_protein	TAYIA
7	unlabeled	P3_20-24	Unlabeled	different_protein	SRQLE
8	unlabeled	P3_14-18	Unlabeled	different_protein	FVKSH
9	control	synth_0	Control	synthetic:global_freq	YKKKG
10	control	synth_1	Control	synthetic:global_freq	FHGQA
11	control	synth_2	Control	synthetic:global_freq	MALDK

The output is a row-wise concatenation of every arm’s 'segments' output (eight-column schema + arm). There is no automatic cross-arm dedupe — every sampled row is preserved; deduplicate protein-sourced windows on entry_win and synthetic windows on window if needed. arm + strategy + role together carry full row provenance.

seed is the master seed (falling back to the constructor random_state). Per-arm sub-seeds are derived deterministically with numpy.random.SeedSequence, so the same seed reproduces the same benchmark set:

df_a = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=42)
df_b = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=42)
print("identical for equal seed:", df_a.equals(df_b))

identical for equal seed: True