aaanalysis.AAWindowSampler

class aaanalysis.AAWindowSampler(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10)[source]

Bases: object

Utility class for sampling amino-acid windows / segments from full protein sequences.

Four sampling strategies are provided:

Output modes (per method, except sample_synthetic() which is segments-only):

  • "segments" — one row per sampled window, schema [entry_win, entry, sequence, window, source_position, label, role, strategy]. entry_win = <entry>_<start_pos>-<end_pos> (1-based inclusive); the same biological window across calls produces the same entry_win, so drop_duplicates(subset="entry_win") is the natural cross-call dedupe primitive. Synthetic outputs use entry_win = "synth_{i}" with a per-call counter — concatenating multiple sample_synthetic() outputs may collide; deduplicate on the window column instead.

  • "sequences" — one row per source protein with a labels list of length len(sequence) carrying label_test at known test positions, label_ref at sampled positions, and None elsewhere.

Added in version 1.1.0.

Notes

Class type:

This is a utility class — it does not implement .fit / .run / .eval. Compute output-quality metrics from the returned DataFrame using aaanalysis.metrics.comp_kld() or the backend window_identity helper.

Identity-based similarity:

Two filters operate on per-position residue identity of equal-length windows (no alignment needed):

  • max_similarity_to_test — drop sampled windows whose identity to any known test window exceeds the threshold (anti-leakage).

  • max_similarity_within_ref — greedily drop sampled windows whose identity to a previously kept sampled window exceeds the threshold (redundancy reduction). In sample_same_protein(), this filter spans protein boundaries; protein iteration order is randomized under the seed so output depends only on df_seq content + seed, not row order.

Iterative filtering:

If filtering shrinks the candidate pool below the target, additional draws are performed up to max_sampling_attempts. If still insufficient, a warning is emitted and the available samples are returned.

Anchoring convention:

Positions in pos_col and the emitted source_position are interpreted as P1-style residue anchors under Schechter–Berger cleavage nomenclature [Rawlings16]. For window length L, the window covers (L - 1) // 2 residues upstream of the anchor, the anchor itself, and L // 2 residues downstream — right-heavy for even L.

See also

Parameters:

Methods

sample_different_protein([df_seq, n, ...])

Sample windows from proteins outside the test set (proteins with no test positions).

sample_motif_matched([df_seq, n, ...])

Scan candidate proteins for windows matching a user-supplied PWM (FIMO-equivalent).

sample_same_protein([df_seq, n, ...])

Sample windows from proteins that contain at least one test position.

sample_synthetic([df_seq, n, window_size, ...])

Generate synthetic control windows.

__init__(verbose=True, random_state=None, max_similarity_to_test=None, max_similarity_within_ref=None, filter_iteratively=True, max_sampling_attempts=10)[source]
Parameters:
  • verbose (bool, default=True) – Verbose mode.

  • random_state (int, optional) – Default seed for all sampling methods. A per-call seed overrides it.

  • max_similarity_to_test (float in [0, 1], optional) – Drop sampled windows whose per-position identity to any test window exceeds this threshold.

  • max_similarity_within_ref (float in [0, 1], optional) – Greedily drop sampled windows whose per-position identity to a previously kept sampled window exceeds this threshold.

  • filter_iteratively (bool, default=True) – Iteratively re-draw if filtering reduces the candidate pool below the target.

  • max_sampling_attempts (int, default=10) – Cap on iterative re-draw attempts.