AAWindowSampler

class AAWindowSampler(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10, custom_filter=None)[source]

Bases: object

Utility class for sampling amino-acid windows / segments from full protein sequences.

Four sampling strategies are provided:

Output modes (per method, except sample_synthetic() which is segments-only):

  • "segments" — one row per sampled window, schema [entry_win, entry, sequence, window, source_position, label, role, strategy]. entry_win = <entry>_<start_pos>-<end_pos> (1-based inclusive); the same biological window across calls produces the same entry_win, so drop_duplicates(subset="entry_win") is the natural cross-call dedupe primitive. Synthetic outputs use entry_win = "synth_{i}" with a per-call counter — concatenating multiple sample_synthetic() outputs may collide; deduplicate on the window column instead.

  • "sequences" — one row per source protein with a labels list of length len(sequence) carrying label_test at known test positions, label_ref at sampled positions, and None elsewhere.

Added in version 1.1.0.

Notes

Class type:

This is a utility class — it does not implement .fit / .run / .eval. Compute output-quality metrics from the returned DataFrame using aaanalysis.metrics.comp_kld() or the backend window_identity helper.

Identity-based similarity:

Two filters operate on per-position residue identity of equal-length windows (no alignment needed):

  • max_similarity_to_test — drop sampled windows whose identity to any known test window exceeds the threshold (anti-leakage).

  • max_similarity_within_ref — greedily drop sampled windows whose identity to a previously kept sampled window exceeds the threshold (redundancy reduction). In sample_same_protein(), this filter spans protein boundaries; protein iteration order is randomized under the seed so output depends only on df_seq content + seed, not row order.

Iterative filtering:

If filtering shrinks the candidate pool below the target, additional draws are performed up to max_sampling_attempts. If still insufficient, a warning is emitted and the available samples are returned.

Anchoring convention:

Positions in pos_col and the emitted source_position are interpreted as P1-style residue anchors under Schechter–Berger cleavage nomenclature [Rawlings16]. For window length L, the window covers (L - 1) // 2 residues upstream of the anchor, the anchor itself, and L // 2 residues downstream — right-heavy for even L.

See also

Parameters:

Methods

sample_benchmark_set(df_seq, arms[, seed])

Run several named sampling arms and concatenate them into one benchmark set.

sample_different_protein(df_seq[, n, ...])

Sample windows from proteins outside the test set (proteins with no test positions).

sample_motif_matched(df_seq[, n, ...])

Scan candidate proteins for windows matching a user-supplied Position Weight Matrix (PWM); a Find Individual Motif Occurrences (FIMO) equivalent.

sample_same_protein(df_seq[, n, ...])

Sample windows from proteins that contain at least one test position.

sample_synthetic(df_seq[, n, window_size, ...])

Generate synthetic control windows.

__init__(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10, custom_filter=None)[source]
Parameters:
  • verbose (bool, default=True) – If True, prints sampling progress and warnings when fewer windows than requested are returned.

  • random_state (int, optional) – Default seed for all sampling methods. A per-call seed overrides it.

  • max_similarity_to_test (float in [0, 1], optional) – Drop sampled windows whose per-position identity to any test window exceeds this threshold.

  • max_similarity_within_ref (float in [0, 1], optional) – Greedily drop sampled windows whose per-position identity to a previously kept sampled window exceeds this threshold.

  • filter_iteratively (bool, default=True) – Iteratively re-draw if filtering reduces the candidate pool below the target.

  • max_sampling_attempts (int, default=10) – Cap on iterative re-draw attempts.

  • custom_filter (callable, optional) – User-supplied keep-predicate (window, entry, source_position) -> bool applied to every sampled window across all sample_* methods; a window is kept only when it returns True. window is the window string, entry its source protein, and source_position the 1-based P1 anchor. The escape hatch for structure- / domain-specific decoy rules. Synthetic windows have no source protein, so it is called with entry="" and source_position=-1. If the predicate raises during sampling, the error surfaces as a RuntimeError naming the offending window (the original exception is chained).