aaanalysis.AAWindowSampler
- class aaanalysis.AAWindowSampler(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10)[source]
Bases:
objectUtility class for sampling amino-acid windows / segments from full protein sequences.
Four sampling strategies are provided:
sample_same_protein()— windows from proteins that contain at least one test position [Boyd10Cascleave], [Song12].sample_different_protein()— windows from proteins outside the test set; naturally suited as the unlabeled poolUfor positive-unlabeled learning [ElkanNoto08], [BekkerDavis20].sample_synthetic()— synthetic control windows from built-in priors, AAontology presets [Rawlings16], multiplicative preset mixes [LiuDeber99], or custom-alphabet frequency tables.sample_motif_matched()— in-memory FIMO-style scan against a user-supplied PWM; a CLI parity wrapper that delegates tofimolives ataaanalysis.scan_motif().
Output modes (per method, except
sample_synthetic()which is segments-only):"segments"— one row per sampled window, schema[entry_win, entry, sequence, window, source_position, label, role, strategy].entry_win = <entry>_<start_pos>-<end_pos>(1-based inclusive); the same biological window across calls produces the sameentry_win, sodrop_duplicates(subset="entry_win")is the natural cross-call dedupe primitive. Synthetic outputs useentry_win = "synth_{i}"with a per-call counter — concatenating multiplesample_synthetic()outputs may collide; deduplicate on thewindowcolumn instead."sequences"— one row per source protein with alabelslist of lengthlen(sequence)carryinglabel_testat known test positions,label_refat sampled positions, andNoneelsewhere.
Added in version 1.1.0.
Notes
- Class type:
This is a utility class — it does not implement
.fit/.run/.eval. Compute output-quality metrics from the returned DataFrame usingaaanalysis.metrics.comp_kld()or the backendwindow_identityhelper.- Identity-based similarity:
Two filters operate on per-position residue identity of equal-length windows (no alignment needed):
max_similarity_to_test— drop sampled windows whose identity to any known test window exceeds the threshold (anti-leakage).max_similarity_within_ref— greedily drop sampled windows whose identity to a previously kept sampled window exceeds the threshold (redundancy reduction). Insample_same_protein(), this filter spans protein boundaries; protein iteration order is randomized under the seed so output depends only ondf_seqcontent + seed, not row order.
- Iterative filtering:
If filtering shrinks the candidate pool below the target, additional draws are performed up to
max_sampling_attempts. If still insufficient, a warning is emitted and the available samples are returned.- Anchoring convention:
Positions in
pos_coland the emittedsource_positionare interpreted as P1-style residue anchors under Schechter–Berger cleavage nomenclature [Rawlings16]. For window lengthL, the window covers(L - 1) // 2residues upstream of the anchor, the anchor itself, andL // 2residues downstream — right-heavy for evenL.
See also
aaanalysis.SequencePreprocessorforaa_windowextraction primitives.aaanalysis.SequenceFeaturefor canonicaldf_seqformats and conventions.
- Parameters:
Methods
sample_different_protein([df_seq, n, ...])Sample windows from proteins outside the test set (proteins with no test positions).
sample_motif_matched([df_seq, n, ...])Scan candidate proteins for windows matching a user-supplied PWM (FIMO-equivalent).
sample_same_protein([df_seq, n, ...])Sample windows from proteins that contain at least one test position.
sample_synthetic([df_seq, n, window_size, ...])Generate synthetic control windows.
- __init__(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10)[source]
- Parameters:
verbose (bool, default=True) – Verbose mode.
random_state (int, optional) – Default seed for all sampling methods. A per-call
seedoverrides it.max_similarity_to_test (float in [0, 1], optional) – Drop sampled windows whose per-position identity to any test window exceeds this threshold.
max_similarity_within_ref (float in [0, 1], optional) – Greedily drop sampled windows whose per-position identity to a previously kept sampled window exceeds this threshold.
filter_iteratively (bool, default=True) – Iteratively re-draw if filtering reduces the candidate pool below the target.
max_sampling_attempts (int, default=10) – Cap on iterative re-draw attempts.