aaanalysis.AAWindowSampler.sample_motif_matched
- AAWindowSampler.sample_motif_matched(df_seq=None, n=100, window_size=9, motif_pwm=None, motif_score_threshold=None, pos_col='pos', label_test=1, label_ref=0, role='Negative', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, seed=None)[source]
Scan candidate proteins for windows matching a user-supplied PWM (FIMO-equivalent).
Useful for hard-negative mining: candidates that look like positives at the local-motif level but were not labeled positive.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. See Notes.n (int, default=100) – Maximum number of motif-matched windows to return.
window_size (int, default=9) – Window length; must equal the first dimension of
motif_pwm.motif_pwm (pd.DataFrame) – Position-weight matrix of shape
(window_size, 20)whose columns are the 20 canonical AA letters in any order (reindexed internally tout.LIST_CANONICAL_AA). Required.motif_score_threshold (float) – Score threshold (sum of per-position PWM values). Required.
pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.
label_test (int or float, default=1) – Label assigned to positives in
output_mode='sequences'.label_ref (int or float, default=0) – Label assigned to sampled motif-matched positions / rows.
role (str, default='Negative') – Role tag stored in the output’s
rolecolumn.output_mode ({'segments', 'sequences'}, default='segments') – Output schema; see
sample_same_protein()Notes.aa_context_col (str, optional) – Per-residue context column used with
context_in/context_out.context_in (value or list-like, optional) – Whitelist of
aa_context_coltag values for eligible residues.context_out (value or list-like, optional) – Blacklist of
aa_context_coltag values for excluded residues.seed (int, optional) – Per-call seed; falls back to the class-level
random_state.
- Returns:
df_seq – Sampled windows; one row per window with
entry,sequence,role,strategy, andentry_wincolumns.- Return type:
pd.DataFrame
Notes
Rows of
df_seqwhosepos_colcell is a non-empty list / tuple / array of 1-based positions are positive rows — they are excluded from the scan and contribute their windows only to themax_similarity_to_testfilter. Rows with empty /None/NaNcells form the candidate pool, where every position with a fully-fitting window is scored againstmotif_pwm(sum of per-position values; non-canonical residues contribute zero). Positions with score>= motif_score_thresholdare returned, ranked by descending score among those that survive the identity / context filters, and capped atn.Unlike
sample_same_protein()andsample_different_protein(), this method does not acceptmotif_match— it always returns high-scoring matches. For the inverse operation (“sample windows that do NOT match a motif”), callsample_different_protein()with the samemotif_pwmandmotif_match='out'.Examples
Scan candidate proteins (rows of
df_seqwith no labeled positions) for windows matching a user-supplied PWM, and return the top-scoring hits. Useful for hard-negative mining: unlabeled windows that look biochemically similar to the positives at the local-motif level.import aaanalysis as aa import pandas as pd import numpy as np aa.options["verbose"] = False df_seq = pd.DataFrame({ "entry": ["P1", "P2", "P3"], "sequence": ["ACDEFGHIK", "AAACDEFGHIKLMNPQRSTV", "GGGGAAAGGGGAAAGGGGAA"], "pos": [[5], [], []], }) sampler = aa.AAWindowSampler(random_state=0) aa_cols = list("ACDEFGHIKLMNPQRSTVWY") pwm = pd.DataFrame(0.0, index=range(3), columns=aa_cols) pwm["A"] = 1.0 # prefers A at every position of a length-3 window aa.display_df(df=pwm, show_shape=True)
DataFrame shape: (3, 20)
A C D E F G H I K L M N P Q R S T V W Y 1 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 2 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 3 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 First call — anchor schema. The
segmentsschema is augmented with amotif_scorecolumn carrying the raw PWM score; results are ranked by descending score and capped atn.df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos", n=5, window_size=3, motif_pwm=pwm, motif_score_threshold=2.5, seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (3, 9)
entry_win entry sequence window source_position label role strategy motif_score 1 P2_1-3 P2 AAACDEFGHIKLMNPQRSTV AAA 2 0 Negative motif_matched 3.000000 2 P3_5-7 P3 GGGGAAAGGGGAAAGGGGAA AAA 6 0 Negative motif_matched 3.000000 3 P3_12-14 P3 GGGGAAAGGGGAAAGGGGAA AAA 13 0 Negative motif_matched 3.000000 Same dual role as
sample_different_protein: rows with positives are excluded from the scan (and feed the anti-leakage filter), rows with emptypos_colcells form the candidate pool. Every position with a fully-fitting window in the candidate pool is scored againstmotif_pwm.ncaps the number of motif-matched windows returned (after threshold and ranking).window_sizemust equal the first dimension ofmotif_pwm:df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos", n=2, window_size=3, motif_pwm=pwm, motif_score_threshold=2.0, seed=0) aa.display_df(df=df[["entry_win", "window", "motif_score"]], show_shape=True)
DataFrame shape: (2, 3)
entry_win window motif_score 1 P2_1-3 AAA 3.000000 2 P3_5-7 AAA 3.000000 Two output schemas:
'segments'(default) — one row per sampled window, with an extramotif_scorecolumn.'sequences'— one row per source protein with a per-residuelabelslist.
df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos", n=3, window_size=3, motif_pwm=pwm, motif_score_threshold=2.5, output_mode="sequences", seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (3, 3)
entry sequence labels 1 P1 ACDEFGHIK [None, None, None, None, 1, None, None, None, None] 2 P2 AAACDEFGHIKLMNPQRSTV [None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] 3 P3 GGGGAAAGGGGAAAGGGGAA [None, None, None, None, None, 0, None, None, None, None, None, None, 0, None, None, None, None, None, None, None] Semantic tags. Defaults match hard-negative mining:
role='Negative',label_test=1,label_ref=0:df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos", n=3, window_size=3, motif_pwm=pwm, motif_score_threshold=2.5, role="hard_negative", seed=0) aa.display_df(df=df[["entry_win", "role", "label", "motif_score"]], show_shape=True)
DataFrame shape: (3, 4)
entry_win role label motif_score 1 P2_1-3 hard_negative 0 3.000000 2 P3_5-7 hard_negative 0 3.000000 3 P3_12-14 hard_negative 0 3.000000 Per-call seed; falls back to the class-level
random_state. Seeaws_sample_same_proteinfor the class-level anti-leakage / redundancy filters (max_similarity_to_test,max_similarity_within_ref), which apply identically here.Note the asymmetry with the other sampling methods:
sample_motif_matchedalways returns high-scoring matches and does not acceptmotif_match. For the inverse (windows that explicitly do not match a motif), callsample_different_proteinwith the samemotif_pwmandmotif_match='out'.df_a = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos", n=3, window_size=3, motif_pwm=pwm, motif_score_threshold=2.5, seed=21) df_b = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos", n=3, window_size=3, motif_pwm=pwm, motif_score_threshold=2.5, seed=21) print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))
deterministic: True