AAWindowSampler

class AAWindowSampler(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10, custom_filter=None)[source]

Bases: object

Utility class for sampling amino-acid windows / segments from full protein sequences.

Four sampling strategies are provided:

sample_same_protein() — windows from proteins that contain at least one test position [Boyd10Cascleave], [Song12].
sample_different_protein() — windows from proteins outside the test set; naturally suited as the unlabeled pool U for positive-unlabeled learning [ElkanNoto08], [BekkerDavis20].
sample_synthetic() — synthetic control windows from built-in priors, AAontology presets [Rawlings16], multiplicative preset mixes [LiuDeber99], or custom-alphabet frequency tables.
sample_motif_matched() — in-memory FIMO (Find Individual Motif Occurrences)-style scan against a user-supplied Position Weight Matrix (PWM); a Command-Line Interface (CLI) parity wrapper that delegates to fimo lives at aaanalysis.scan_motif().

Output modes (per method, except sample_synthetic() which is segments-only):

"segments" — one row per sampled window, schema [entry_win, entry, sequence, window, source_position, label, role, strategy]. entry_win = <entry>_<start_pos>-<end_pos> (1-based inclusive); the same biological window across calls produces the same entry_win, so drop_duplicates(subset="entry_win") is the natural cross-call dedupe primitive. Synthetic outputs use entry_win = "synth_{i}" with a per-call counter — concatenating multiple sample_synthetic() outputs may collide; deduplicate on the window column instead.
"sequences" — one row per source protein with a labels list of length len(sequence) carrying label_test at known test positions, label_ref at sampled positions, and None elsewhere.

Added in version 1.1.0.

Notes

Class type:

This is a utility class — it does not implement .fit / .run / .eval. Compute output-quality metrics from the returned DataFrame using aaanalysis.metrics.comp_kld() or the backend window_identity helper.

Identity-based similarity:

Two filters operate on per-position residue identity of equal-length windows (no alignment needed):

max_similarity_to_test — drop sampled windows whose identity to any known test window exceeds the threshold (anti-leakage).
max_similarity_within_ref — greedily drop sampled windows whose identity to a previously kept sampled window exceeds the threshold (redundancy reduction). In sample_same_protein(), this filter spans protein boundaries; protein iteration order is randomized under the seed so output depends only on df_seq content + seed, not row order.

Iterative filtering:

If filtering shrinks the candidate pool below the target, additional draws are performed up to max_sampling_attempts. If still insufficient, a warning is emitted and the available samples are returned.

Anchoring convention:

Positions in pos_col and the emitted source_position are interpreted as P1-style residue anchors under Schechter–Berger cleavage nomenclature [Rawlings16]. For window length L, the window covers (L - 1) // 2 residues upstream of the anchor, the anchor itself, and L // 2 residues downstream — right-heavy for even L.

See also

aaanalysis.SequencePreprocessor for aa_window extraction primitives.
aaanalysis.SequenceFeature for canonical df_seq formats and conventions.

Parameters:

verbose (bool)
random_state (Optional[int])
max_similarity_to_test (Optional[float])
max_similarity_within_ref (Optional[float])
filter_iteratively (bool)
max_sampling_attempts (int)
custom_filter (Optional[Callable[[str, str, int], bool]])

Methods

`sample_benchmark_set`(df_seq, arms[, seed])	Run several named sampling arms and concatenate them into one benchmark set.
`sample_different_protein`(df_seq[, n, ...])	Sample windows from proteins outside the test set (proteins with no test positions).
`sample_motif_matched`(df_seq[, n, ...])	Scan candidate proteins for windows matching a user-supplied Position Weight Matrix (PWM); a Find Individual Motif Occurrences (FIMO) equivalent.
`sample_same_protein`(df_seq[, n, ...])	Sample windows from proteins that contain at least one test position.
`sample_synthetic`(df_seq[, n, window_size, ...])	Generate synthetic control windows.

__init__(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10, custom_filter=None)[source]

Parameters:

verbose (bool, default=True) – If True, prints sampling progress and warnings when fewer windows than requested are returned.
random_state (int, optional) – Default seed for all sampling methods. A per-call seed overrides it.
max_similarity_to_test (float in [0, 1], optional) – Drop sampled windows whose per-position identity to any test window exceeds this threshold.
max_similarity_within_ref (float in [0, 1], optional) – Greedily drop sampled windows whose per-position identity to a previously kept sampled window exceeds this threshold.
filter_iteratively (bool, default=True) – Iteratively re-draw if filtering reduces the candidate pool below the target.
max_sampling_attempts (int, default=10) – Cap on iterative re-draw attempts.
custom_filter (callable, optional) – User-supplied keep-predicate (window, entry, source_position) -> bool applied to every sampled window across all sample_* methods; a window is kept only when it returns True. window is the window string, entry its source protein, and source_position the 1-based P1 anchor. The escape hatch for structure- / domain-specific decoy rules. Synthetic windows have no source protein, so it is called with entry="" and source_position=-1. If the predicate raises during sampling, the error surfaces as a RuntimeError naming the offending window (the original exception is chained).