aaanalysis.AAWindowSampler.sample_motif_matched

AAWindowSampler.sample_motif_matched(df_seq=None, n=100, window_size=9, motif_pwm=None, motif_score_threshold=None, pos_col='pos', label_test=1, label_ref=0, role='Negative', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, seed=None)[source]

Scan candidate proteins for windows matching a user-supplied PWM (FIMO-equivalent).

Useful for hard-negative mining: candidates that look like positives at the local-motif level but were not labeled positive.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. See Notes.

  • n (int, default=100) – Maximum number of motif-matched windows to return.

  • window_size (int, default=9) – Window length; must equal the first dimension of motif_pwm.

  • motif_pwm (pd.DataFrame) – Position-weight matrix of shape (window_size, 20) whose columns are the 20 canonical AA letters in any order (reindexed internally to ut.LIST_CANONICAL_AA). Required.

  • motif_score_threshold (float) – Score threshold (sum of per-position PWM values). Required.

  • pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.

  • label_test (int or float, default=1) – Label assigned to positives in output_mode='sequences'.

  • label_ref (int or float, default=0) – Label assigned to sampled motif-matched positions / rows.

  • role (str, default='Negative') – Role tag stored in the output’s role column.

  • output_mode ({'segments', 'sequences'}, default='segments') – Output schema; see sample_same_protein() Notes.

  • aa_context_col (str, optional) – Per-residue context column used with context_in / context_out.

  • context_in (value or list-like, optional) – Whitelist of aa_context_col tag values for eligible residues.

  • context_out (value or list-like, optional) – Blacklist of aa_context_col tag values for excluded residues.

  • seed (int, optional) – Per-call seed; falls back to the class-level random_state.

Returns:

df_seq – Sampled windows; one row per window with entry, sequence, role, strategy, and entry_win columns.

Return type:

pd.DataFrame

Notes

Rows of df_seq whose pos_col cell is a non-empty list / tuple / array of 1-based positions are positive rows — they are excluded from the scan and contribute their windows only to the max_similarity_to_test filter. Rows with empty / None / NaN cells form the candidate pool, where every position with a fully-fitting window is scored against motif_pwm (sum of per-position values; non-canonical residues contribute zero). Positions with score >= motif_score_threshold are returned, ranked by descending score among those that survive the identity / context filters, and capped at n.

Unlike sample_same_protein() and sample_different_protein(), this method does not accept motif_match — it always returns high-scoring matches. For the inverse operation (“sample windows that do NOT match a motif”), call sample_different_protein() with the same motif_pwm and motif_match='out'.

Examples

Scan candidate proteins (rows of df_seq with no labeled positions) for windows matching a user-supplied PWM, and return the top-scoring hits. Useful for hard-negative mining: unlabeled windows that look biochemically similar to the positives at the local-motif level.

import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2", "P3"],
    "sequence": ["ACDEFGHIK",
                 "AAACDEFGHIKLMNPQRSTV",
                 "GGGGAAAGGGGAAAGGGGAA"],
    "pos":      [[5], [], []],
})
sampler = aa.AAWindowSampler(random_state=0)

aa_cols = list("ACDEFGHIKLMNPQRSTVWY")
pwm = pd.DataFrame(0.0, index=range(3), columns=aa_cols)
pwm["A"] = 1.0  # prefers A at every position of a length-3 window
aa.display_df(df=pwm, show_shape=True)
DataFrame shape: (3, 20)
  A C D E F G H I K L M N P Q R S T V W Y
1 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
2 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
3 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

First call — anchor schema. The segments schema is augmented with a motif_score column carrying the raw PWM score; results are ranked by descending score and capped at n.

df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=5, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.5,
                                  seed=0)
aa.display_df(df=df, show_shape=True)
DataFrame shape: (3, 9)
  entry_win entry sequence window source_position label role strategy motif_score
1 P2_1-3 P2 AAACDEFGHIKLMNPQRSTV AAA 2 0 Negative motif_matched 3.000000
2 P3_5-7 P3 GGGGAAAGGGGAAAGGGGAA AAA 6 0 Negative motif_matched 3.000000
3 P3_12-14 P3 GGGGAAAGGGGAAAGGGGAA AAA 13 0 Negative motif_matched 3.000000

Same dual role as sample_different_protein: rows with positives are excluded from the scan (and feed the anti-leakage filter), rows with empty pos_col cells form the candidate pool. Every position with a fully-fitting window in the candidate pool is scored against motif_pwm.

n caps the number of motif-matched windows returned (after threshold and ranking). window_size must equal the first dimension of motif_pwm:

df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=2, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.0,
                                  seed=0)
aa.display_df(df=df[["entry_win", "window", "motif_score"]], show_shape=True)
DataFrame shape: (2, 3)
  entry_win window motif_score
1 P2_1-3 AAA 3.000000
2 P3_5-7 AAA 3.000000

Two output schemas:

  • 'segments' (default) — one row per sampled window, with an extra motif_score column.

  • 'sequences' — one row per source protein with a per-residue labels list.

df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=3, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.5,
                                  output_mode="sequences", seed=0)
aa.display_df(df=df, show_shape=True)
DataFrame shape: (3, 3)
  entry sequence labels
1 P1 ACDEFGHIK [None, None, None, None, 1, None, None, None, None]
2 P2 AAACDEFGHIKLMNPQRSTV [None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
3 P3 GGGGAAAGGGGAAAGGGGAA [None, None, None, None, None, 0, None, None, None, None, None, None, 0, None, None, None, None, None, None, None]

Semantic tags. Defaults match hard-negative mining: role='Negative', label_test=1, label_ref=0:

df = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=3, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.5,
                                  role="hard_negative", seed=0)
aa.display_df(df=df[["entry_win", "role", "label", "motif_score"]], show_shape=True)
DataFrame shape: (3, 4)
  entry_win role label motif_score
1 P2_1-3 hard_negative 0 3.000000
2 P3_5-7 hard_negative 0 3.000000
3 P3_12-14 hard_negative 0 3.000000

Per-call seed; falls back to the class-level random_state. See aws_sample_same_protein for the class-level anti-leakage / redundancy filters (max_similarity_to_test, max_similarity_within_ref), which apply identically here.

Note the asymmetry with the other sampling methods: sample_motif_matched always returns high-scoring matches and does not accept motif_match. For the inverse (windows that explicitly do not match a motif), call sample_different_protein with the same motif_pwm and motif_match='out'.

df_a = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                    n=3, window_size=3,
                                    motif_pwm=pwm, motif_score_threshold=2.5,
                                    seed=21)
df_b = sampler.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                    n=3, window_size=3,
                                    motif_pwm=pwm, motif_score_threshold=2.5,
                                    seed=21)
print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))
deterministic: True