AAWindowSampler.sample_benchmark_set

AAWindowSampler.sample_benchmark_set(df_seq=None, arms=None, seed=None)[source]

Run several named sampling arms and concatenate them into one benchmark set.

Thin multi-arm orchestrator over the individual sample_* methods: it adds no new sampling behavior. Each arm is one ordinary sample_* call in 'segments' mode, tagged with its arm name in an extra arm column so a downstream benchmark can consume any mix of arms uniformly.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Passed to every arm.

  • arms (dict) – Mapping {arm_name: {"method": <strategy>, **kwargs}}. method is one of 'same_protein', 'different_protein', 'synthetic', 'motif_matched' (ut.LIST_STRATEGIES); the remaining keys forward as keyword arguments to the matching sample_* method. The reserved keys df_seq, seed, and output_mode are managed here and must not appear in an arm config.

  • seed (int, optional) – Master seed; falls back to the class-level random_state. Per-arm sub-seeds are derived deterministically via numpy.random.SeedSequence, so identical seed values reproduce identical benchmark sets.

Returns:

df_seq_out – Row-wise concatenation of every arm’s 'segments' output with an added arm column. No automatic cross-arm dedupe — every sampled row is preserved. Deduplicate protein-sourced windows on entry_win and synthetic windows on window if needed.

Return type:

pd.DataFrame

Notes

role and strategy tags set by each arm are preserved through the concatenation; together with arm they carry full row provenance. A motif_matched arm adds a motif_score column, which is NaN for rows from other arms.

Examples

Run several named sampling arms in one call and stack them into a single benchmark DataFrame. This is a thin multi-arm orchestrator — each arm is one ordinary sample_* call in 'segments' mode — so a downstream benchmark can consume any mix of negatives, unlabeled, and control rows uniformly.

import aaanalysis as aa
import pandas as pd
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2", "P3"],
    "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2,
                 "YWVTSRQPNMLKIHGFEDCA" * 2,
                 "MKTAYIAKQRQISFVKSHFSRQLEERLGLIEVQ"],
    "pos":      [[5, 25], [15], []],
})
aaws = aa.AAWindowSampler(random_state=0)

arms is a mapping {arm_name: {"method": <strategy>, **kwargs}}. method is one of 'same_protein', 'different_protein', 'synthetic', 'motif_matched' (ut.LIST_STRATEGIES); the remaining keys forward to the matching sample_* method. The reserved keys df_seq, seed, and output_mode are managed by the orchestrator and must not appear in an arm config.

Each arm keeps its own role and strategy tags; an extra arm column records which arm produced the row.

arms = {
    "near_pos":  {"method": "same_protein",     "pos_col": "pos", "n": 4, "window_size": 5, "min_distance_to_pos": 2},
    "unlabeled": {"method": "different_protein", "pos_col": "pos", "n": 4, "window_size": 5},
    "control":   {"method": "synthetic",         "n": 3, "window_size": 5, "generator": "global_freq"},
}
df = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=0)
aa.display_df(df=df[["arm", "entry_win", "role", "strategy", "window"]], show_shape=True)
DataFrame shape: (11, 5)
  arm entry_win role strategy window
1 near_pos P1_12-16 Negative same_protein NPQRS
2 near_pos P1_18-22 Negative same_protein VWYAC
3 near_pos P2_11-15 Negative same_protein LKIHG
4 near_pos P2_16-20 Negative same_protein FEDCA
5 unlabeled P3_26-30 Unlabeled different_protein RLGLI
6 unlabeled P3_3-7 Unlabeled different_protein TAYIA
7 unlabeled P3_20-24 Unlabeled different_protein SRQLE
8 unlabeled P3_14-18 Unlabeled different_protein FVKSH
9 control synth_0 Control synthetic:global_freq YKKKG
10 control synth_1 Control synthetic:global_freq FHGQA
11 control synth_2 Control synthetic:global_freq MALDK

The output is a row-wise concatenation of every arm’s 'segments' output (eight-column schema + arm). There is no automatic cross-arm dedupe — every sampled row is preserved; deduplicate protein-sourced windows on entry_win and synthetic windows on window if needed. arm + strategy + role together carry full row provenance.

seed is the master seed (falling back to the constructor random_state). Per-arm sub-seeds are derived deterministically with numpy.random.SeedSequence, so the same seed reproduces the same benchmark set:

df_a = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=42)
df_b = aaws.sample_benchmark_set(df_seq=df_seq, arms=arms, seed=42)
print("identical for equal seed:", df_a.equals(df_b))
identical for equal seed: True