AAWindowSampler
- class AAWindowSampler(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10, custom_filter=None)[source]
Bases:
objectUtility class for sampling amino-acid windows / segments from full protein sequences.
Four sampling strategies are provided:
sample_same_protein()— windows from proteins that contain at least one test position [Boyd10Cascleave], [Song12].sample_different_protein()— windows from proteins outside the test set; naturally suited as the unlabeled poolUfor positive-unlabeled learning [ElkanNoto08], [BekkerDavis20].sample_synthetic()— synthetic control windows from built-in priors, AAontology presets [Rawlings16], multiplicative preset mixes [LiuDeber99], or custom-alphabet frequency tables.sample_motif_matched()— in-memory FIMO (Find Individual Motif Occurrences)-style scan against a user-supplied Position Weight Matrix (PWM); a Command-Line Interface (CLI) parity wrapper that delegates tofimolives ataaanalysis.scan_motif().
Output modes (per method, except
sample_synthetic()which is segments-only):"segments"— one row per sampled window, schema[entry_win, entry, sequence, window, source_position, label, role, strategy].entry_win = <entry>_<start_pos>-<end_pos>(1-based inclusive); the same biological window across calls produces the sameentry_win, sodrop_duplicates(subset="entry_win")is the natural cross-call dedupe primitive. Synthetic outputs useentry_win = "synth_{i}"with a per-call counter — concatenating multiplesample_synthetic()outputs may collide; deduplicate on thewindowcolumn instead."sequences"— one row per source protein with alabelslist of lengthlen(sequence)carryinglabel_testat known test positions,label_refat sampled positions, andNoneelsewhere.
Added in version 1.1.0.
Notes
- Class type:
This is a utility class — it does not implement
.fit/.run/.eval. Compute output-quality metrics from the returned DataFrame usingaaanalysis.metrics.comp_kld()or the backendwindow_identityhelper.- Identity-based similarity:
Two filters operate on per-position residue identity of equal-length windows (no alignment needed):
max_similarity_to_test— drop sampled windows whose identity to any known test window exceeds the threshold (anti-leakage).max_similarity_within_ref— greedily drop sampled windows whose identity to a previously kept sampled window exceeds the threshold (redundancy reduction). Insample_same_protein(), this filter spans protein boundaries; protein iteration order is randomized under the seed so output depends only ondf_seqcontent + seed, not row order.
- Iterative filtering:
If filtering shrinks the candidate pool below the target, additional draws are performed up to
max_sampling_attempts. If still insufficient, a warning is emitted and the available samples are returned.- Anchoring convention:
Positions in
pos_coland the emittedsource_positionare interpreted as P1-style residue anchors under Schechter–Berger cleavage nomenclature [Rawlings16]. For window lengthL, the window covers(L - 1) // 2residues upstream of the anchor, the anchor itself, andL // 2residues downstream — right-heavy for evenL.
See also
aaanalysis.SequencePreprocessorforaa_windowextraction primitives.aaanalysis.SequenceFeaturefor canonicaldf_seqformats and conventions.
- Parameters:
Methods
sample_benchmark_set(df_seq, arms[, seed])Run several named sampling arms and concatenate them into one benchmark set.
sample_different_protein(df_seq[, n, ...])Sample windows from proteins outside the test set (proteins with no test positions).
sample_motif_matched(df_seq[, n, ...])Scan candidate proteins for windows matching a user-supplied Position Weight Matrix (PWM); a Find Individual Motif Occurrences (FIMO) equivalent.
sample_same_protein(df_seq[, n, ...])Sample windows from proteins that contain at least one test position.
sample_synthetic(df_seq[, n, window_size, ...])Generate synthetic control windows.
- __init__(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10, custom_filter=None)[source]
- Parameters:
verbose (bool, default=True) – If
True, prints sampling progress and warnings when fewer windows than requested are returned.random_state (int, optional) – Default seed for all sampling methods. A per-call
seedoverrides it.max_similarity_to_test (float in [0, 1], optional) – Drop sampled windows whose per-position identity to any test window exceeds this threshold.
max_similarity_within_ref (float in [0, 1], optional) – Greedily drop sampled windows whose per-position identity to a previously kept sampled window exceeds this threshold.
filter_iteratively (bool, default=True) – Iteratively re-draw if filtering reduces the candidate pool below the target.
max_sampling_attempts (int, default=10) – Cap on iterative re-draw attempts.
custom_filter (callable, optional) – User-supplied keep-predicate
(window, entry, source_position) -> boolapplied to every sampled window across allsample_*methods; a window is kept only when it returnsTrue.windowis the window string,entryits source protein, andsource_positionthe 1-based P1 anchor. The escape hatch for structure- / domain-specific decoy rules. Synthetic windows have no source protein, so it is called withentry=""andsource_position=-1. If the predicate raises during sampling, the error surfaces as aRuntimeErrornaming the offending window (the original exception is chained).