aaanalysis.AAWindowSampler.sample_same_protein
- AAWindowSampler.sample_same_protein(df_seq=None, n=100, window_size=9, pos_col='pos', min_distance_to_pos=None, max_distance_to_pos=None, label_test=1, label_ref=0, role='Negative', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, motif_pwm=None, motif_score_threshold=None, motif_match='in', seed=None)[source]
Sample windows from proteins that contain at least one test position.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. See Notes.n (int, default=100) – Maximum total number of sampled windows across all eligible proteins.
nis split roughly uniformly across eligible source proteins (each protein gets ~``n / n_proteins`` windows); shortfalls from proteins with small candidate pools are redistributed round-robin. Fewer thannare returned (with a warning) if the eligible space cannot supply.window_size (int, default=9) – Length of each sampled window in residues.
pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.
min_distance_to_pos (int, optional) – Minimum residue distance from the nearest positive on the same protein (a sampled P1 anchor
cis admitted only ifmin(|c - p| for p in positives) >= min_distance_to_pos).None(default) drops this lower bound — sampled windows are allowed to overlap positive windows.max_distance_to_pos (int, optional) – Maximum residue distance from the nearest positive on the same protein (a sampled P1 anchor
cis admitted only ifmin(|c - p| for p in positives) <= max_distance_to_pos).None(default) drops this upper bound — sampled windows may sit anywhere on the protein.label_test (int or float, default=1) – Label assigned to positives in
output_mode='sequences'.label_ref (int or float, default=0) – Label assigned to sampled reference positions / rows.
role (str, default='Negative') – Role tag stored in the output’s
rolecolumn.output_mode ({'segments', 'sequences'}, default='segments') – Output schema. See Notes.
aa_context_col (str, optional) – Per-residue context column used with
context_in/context_out.context_in (value or list-like, optional) – Whitelist of
aa_context_coltag values for eligible residues.context_out (value or list-like, optional) – Blacklist of
aa_context_coltag values for excluded residues.motif_pwm (pd.DataFrame, optional) – Position-weight matrix of shape
(window_size, 20)whose columns are the 20 canonical AA letters in any order (reindexed internally tout.LIST_CANONICAL_AA). Required together withmotif_score_thresholdwhen motif filtering is desired.motif_score_threshold (float, optional) – PWM score threshold; required when
motif_pwmis set.motif_match ({'in', 'out'}, default='in') –
'in'keeps windows with score>=threshold;'out'keeps the rest.seed (int, optional) – Per-call seed; falls back to the class-level
random_state.
- Returns:
df_seq – Sampled windows; one row per window with
entry,sequence,role,strategy, andentry_wincolumns.- Return type:
pd.DataFrame
Notes
Each row of
df_seqwhosepos_colcell is a non-empty list / tuple / array of 1-based integer positions is a “positive” row; rows with empty /None/NaNcells are skipped. Sampled windows are drawn from the same proteins as the positives; the positive windows themselves drive themax_similarity_to_testfilter. The(min_distance_to_pos, max_distance_to_pos)band is exposed only on this method;sample_different_protein()andsample_motif_matched()sample from proteins with no listed positives, so the band has nothing to act on.With the default
None/Noneband, sampled centers can sit directly on or adjacent to positive anchors, producing windows that overlap positive windows by up towindow_size - 1residues. For hard-negative-style sampling that excludes positional overlap, setmin_distance_to_pos=window_size; to constrain sampled windows to a defined neighborhood of positives (e.g. local hard negatives), pair with a finitemax_distance_to_pos. Content-level overlap is controlled separately bymax_similarity_to_test.Protein iteration order is randomized under the seed; output is independent of
df_seqrow order.output_mode='segments'returns one row per sampled window with schema[entry_win, entry, sequence, window, source_position, label, role, strategy].output_mode='sequences'returns one row per source protein with alabelslist of lengthlen(sequence)carryinglabel_testat positives,label_refat sampled positions, andNoneelsewhere.Examples
Draw fixed-length amino-acid windows from proteins that contain at least one labeled position. The sampled windows are commonly used as negative training rows alongside the positives (PU-learning, hard-negative mining). Positions in
pos_colare interpreted as P1-style anchors.import aaanalysis as aa import pandas as pd aa.options["verbose"] = False df_seq = pd.DataFrame({ "entry": ["P1", "P2"], "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2, "YWVTSRQPNMLKIHGFEDCA" * 2], "pos": [[5, 25], [15]], }) sampler = aa.AAWindowSampler(random_state=0)
First call — anchor schema. Each row is one sampled window with the eight-column
segmentsschema;entry_win = <entry>_<start>-<end>(1-based inclusive) is globally unique by construction, andsource_positionis the 1-based P1 anchor that drove the sample.df = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=6, window_size=5, min_distance_to_pos=2, seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (6, 8)
entry_win entry sequence window source_position label role strategy 1 P1_30-34 P1 ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY LMNPQ 32 0 Negative same_protein 2 P1_10-14 P1 ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY LMNPQ 12 0 Negative same_protein 3 P1_16-20 P1 ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY STVWY 18 0 Negative same_protein 4 P1_17-21 P1 ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY TVWYA 19 0 Negative same_protein 5 P2_24-28 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA TSRQP 26 0 Negative same_protein 6 P2_29-33 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA NMLKI 31 0 Negative same_protein pos_colidentifies the positive rows: cells holding a list/tuple/array of 1-based positions. Rows with empty /None/NaNcells are skipped. Sampled windows come from the same proteins as the positives, but at residues away from those positives (seemin_distance_to_pos).Total number of sampled windows across all eligible proteins, and the residue length of each window:
df = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=9, window_size=7, seed=0) aa.display_df(df=df[["entry_win", "source_position", "window"]], show_shape=True)
DataFrame shape: (9, 3)
entry_win source_position window 1 P1_26-32 29 GHIKLMN 2 P1_8-14 11 IKLMNPQ 3 P1_14-20 17 QRSTVWY 4 P1_15-21 18 RSTVWYA 5 P1_34-40 37 QRSTVWY 6 P1_6-12 9 GHIKLMN 7 P2_22-28 25 WVTSRQP 8 P2_27-33 30 QPNMLKI 9 P2_8-14 11 PNMLKIH Minimum residue distance between a sampled window’s anchor and any positive on the same protein. Larger values push samples further away from the labeled positions:
df = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=6, window_size=5, min_distance_to_pos=8, seed=0) aa.display_df(df=df[["entry", "source_position", "window"]], show_shape=True)
DataFrame shape: (6, 3)
entry source_position window 1 P1 16 QRSTV 2 P1 35 PQRST 3 P1 34 NPQRS 4 P1 37 RSTVW 5 P2 30 PNMLK 6 P2 29 QPNML Two output schemas:
'segments'(default) — one row per sampled window.'sequences'— one row per source protein, with a per-residuelabelslist of lengthlen(sequence)carryinglabel_testat known positives,label_refat sampled positions, andNoneelsewhere. Ready as the target vector for a sliding-window classifier.
df = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=3, window_size=5, output_mode="sequences", seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (2, 3)
entry sequence labels 1 P1 ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY [None, None, None, None, 1, None, None, 0, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, 1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] 2 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA [None, None, None, None, None, None, None, None, None, None, None, None, None, None, 1, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] Semantic tags. Defaults assume PU-learning / hard-negative mining:
role='Negative',label_test=1(applied to positives inoutput_mode='sequences'),label_ref=0(applied to sampled reference rows / positions):df = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=3, window_size=5, role="background", label_ref=-1, seed=0) aa.display_df(df=df[["entry_win", "role", "label"]], show_shape=True)
DataFrame shape: (3, 3)
entry_win role label 1 P1_6-10 background -1 2 P1_15-19 background -1 3 P2_20-24 background -1 Filter eligible residues by per-position annotation (e.g. topology, disorder, secondary structure).
aa_context_colis adf_seqcolumn whose cells are strings (or sequences) of single-character tags with one tag per residue.context_inwhitelists tag values;context_outblacklists them. The three are validated jointly — providingcontext_in/context_outwithoutaa_context_colraises:df_seq_topo = df_seq.assign(topo=["MMMMMMMMMMTTTTTTTTTT" * 2, "TTTTTTTTTTMMMMMMMMMM" * 2]) df = sampler.sample_same_protein(df_seq=df_seq_topo, pos_col="pos", n=6, window_size=3, aa_context_col="topo", context_in="T", seed=0) aa.display_df(df=df[["entry", "source_position", "window"]], show_shape=True)
DataFrame shape: (6, 3)
entry source_position window 1 P1 13 NPQ 2 P1 31 LMN 3 P1 14 PQR 4 P1 35 QRS 5 P2 6 SRQ 6 P2 7 RQP Optional PWM-based filter on the candidate pool.
motif_pwmis a position-weight matrix of shape(window_size, 20):pd.DataFrame(preferred — safer) — columns are the 20 canonical amino acids in any order; reindexed internally.np.ndarray— columns must be in alphabetical order (ACDEFGHIKLMNPQRSTVWY); the validator cannot detect a wrong order.
motif_score_thresholdis the score cutoff (sum of per-position values).motif_match='in'keeps windows with score>=threshold;'out'keeps the rest. All three are validated jointly:aa_cols = list("ACDEFGHIKLMNPQRSTVWY") pwm = pd.DataFrame(0.0, index=range(5), columns=aa_cols) pwm["A"] = 1.0 # PWM preferring A at every position df = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=6, window_size=5, motif_pwm=pwm, motif_score_threshold=2.0, motif_match="in", seed=0) aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (0, 2)
entry_win window Constructor-level knobs on
AAWindowSampleritself (shared across all four sampling methods):max_similarity_to_test— drop sampled windows whose per-position identity to any test window (drawn from the positives) exceeds the threshold. Anti-leakage filter.max_similarity_within_ref— greedily drop sampled windows whose per-position identity to a previously kept sampled window exceeds the threshold. Redundancy reduction.
Set them at construction; iterative re-draw is governed by
filter_iteratively(defaultTrue) and capped bymax_sampling_attempts(default10):strict = aa.AAWindowSampler(random_state=0, max_similarity_to_test=0.6, max_similarity_within_ref=0.8) df = strict.sample_same_protein(df_seq=df_seq, pos_col="pos", n=6, window_size=5, min_distance_to_pos=2, seed=0) aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (6, 2)
entry_win window 1 P1_30-34 LMNPQ 2 P1_16-20 STVWY 3 P1_17-21 TVWYA 4 P1_8-12 IKLMN 5 P2_24-28 TSRQP 6 P2_29-33 NMLKI Per-call seed; falls back to the class-level
random_stateset at construction. Protein iteration order is randomized under the seed, so output depends only ondf_seqcontent + seed, not on row order.df_a = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=6, window_size=5, seed=7) df_b = sampler.sample_same_protein(df_seq=df_seq, pos_col="pos", n=6, window_size=5, seed=7) print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))
deterministic: True