AAWindowSampler.sample_motif_matched

AAWindowSampler.sample_motif_matched(df_seq, n=100, window_size=9, *, motif_pwm, motif_score_threshold, pos_col='pos', label_test=1, label_ref=0, role='Negative', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, seed=None)[source]

Scan candidate proteins for windows matching a user-supplied Position Weight Matrix (PWM); a Find Individual Motif Occurrences (FIMO) equivalent.

Useful for hard-negative mining: candidates that look like positives at the local-motif level but were not labeled positive.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. See Notes.
n (int, default=100) – Maximum number of motif-matched windows to return.
window_size (int, default=9) – Window length; must equal the first dimension of motif_pwm.
motif_pwm (pd.DataFrame) – Position Weight Matrix of shape (window_size, 20) whose columns are the 20 canonical amino acid (AA) letters in any order (reindexed internally to ut.LIST_CANONICAL_AA). Required.
motif_score_threshold (float) – Score threshold (sum of per-position PWM values). Required.
pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.
label_test (int or float, default=1) – Label assigned to positives in output_mode='sequences'.
label_ref (int or float, default=0) – Label assigned to sampled motif-matched positions / rows.
role (str, default='Negative') – Role tag stored in the output’s role column.
output_mode ({'segments', 'sequences'}, default='segments') – Output schema; see sample_same_protein() Notes.
aa_context_col (str, optional) – Per-residue context column used with context_in / context_out.
context_in (value or list-like, optional) – Whitelist of aa_context_col tag values for eligible residues.
context_out (value or list-like, optional) – Blacklist of aa_context_col tag values for excluded residues.
seed (int, optional) – Per-call seed; falls back to the class-level random_state.

Returns:

df_seq_out – Sampled windows; one row per window with entry, sequence, role, strategy, and entry_win columns. An additional motif_score column is appended when output_mode='segments'; it is absent in 'sequences' mode.

Return type:

pd.DataFrame

Notes

Rows of df_seq whose pos_col cell is a non-empty list / tuple / array of 1-based positions are positive rows — they are excluded from the scan and contribute their windows only to the max_similarity_to_test filter. Rows with empty / None / NaN cells form the candidate pool, where every position with a fully-fitting window is scored against motif_pwm (sum of per-position values; non-canonical residues contribute zero). Positions with score >= motif_score_threshold are returned, ranked by descending score among those that survive the identity / context filters, and capped at n.

Unlike sample_same_protein() and sample_different_protein(), this method does not accept motif_match — it always returns high-scoring matches. For the inverse operation (“sample windows that do NOT match a motif”), call sample_different_protein() with the same motif_pwm and motif_match='out'.

Examples

Scan candidate proteins (rows of df_seq with no labeled positions) for windows matching a user-supplied PWM, and return the top-scoring hits. Useful for hard-negative mining: unlabeled windows that look biochemically similar to the positives at the local-motif level.

import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2", "P3"],
    "sequence": ["ACDEFGHIK",
                 "AAACDEFGHIKLMNPQRSTV",
                 "GGGGAAAGGGGAAAGGGGAA"],
    "pos":      [[5], [], []],
})
aaws = aa.AAWindowSampler(random_state=0)

aa_cols = list("ACDEFGHIKLMNPQRSTVWY")
pwm = pd.DataFrame(0.0, index=range(3), columns=aa_cols)
pwm["A"] = 1.0  # prefers A at every position of a length-3 window
aa.display_df(df=pwm, show_shape=True)

DataFrame shape: (3, 20)

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y
1	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
2	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
3	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000

First call — anchor schema. The segments schema is augmented with a motif_score column carrying the raw PWM score; results are ranked by descending score and capped at n.

df = aaws.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=5, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.5,
                                  seed=0)
aa.display_df(df=df, show_shape=True)

DataFrame shape: (3, 9)

	entry_win	entry	sequence	window	source_position	role	strategy	motif_score
1	P2_1-3	P2	AAACDEFGHIKLMNPQRSTV	AAA	2	Negative	motif_matched	3.000000
2	P3_5-7	P3	GGGGAAAGGGGAAAGGGGAA	AAA	6	Negative	motif_matched	3.000000
3	P3_12-14	P3	GGGGAAAGGGGAAAGGGGAA	AAA	13	Negative	motif_matched	3.000000

Same dual role as sample_different_protein: rows with positives are excluded from the scan (and feed the anti-leakage filter), rows with empty pos_col cells form the candidate pool. Every position with a fully-fitting window in the candidate pool is scored against motif_pwm.

n caps the number of motif-matched windows returned (after threshold and ranking). window_size must equal the first dimension of motif_pwm:

df = aaws.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=2, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.0,
                                  seed=0)
aa.display_df(df=df[["entry_win", "window", "motif_score"]], show_shape=True)

DataFrame shape: (2, 3)

	entry_win	window	motif_score
1	P2_1-3	AAA	3.000000
2	P3_5-7	AAA	3.000000

Two output schemas:

'segments' (default) — one row per sampled window, with an extra motif_score column.
'sequences' — one row per source protein with a per-residue labels list.

df = aaws.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=3, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.5,
                                  output_mode="sequences", seed=0)
aa.display_df(df=df, show_shape=True)

DataFrame shape: (3, 3)

	entry	sequence	labels
1	P1	ACDEFGHIK	[None, None, None, None, 1, None, None, None, None]
2	P2	AAACDEFGHIKLMNPQRSTV	[None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
3	P3	GGGGAAAGGGGAAAGGGGAA	[None, None, None, None, None, 0, None, None, None, None, None, None, 0, None, None, None, None, None, None, None]

Semantic tags. Defaults match hard-negative mining: role='Negative', label_test=1, label_ref=0:

df = aaws.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                  n=3, window_size=3,
                                  motif_pwm=pwm, motif_score_threshold=2.5,
                                  role="hard_negative", seed=0)
aa.display_df(df=df[["entry_win", "role", "label", "motif_score"]], show_shape=True)

DataFrame shape: (3, 4)

	entry_win	role	motif_score
1	P2_1-3	hard_negative	3.000000
2	P3_5-7	hard_negative	3.000000
3	P3_12-14	hard_negative	3.000000

Per-call seed; falls back to the class-level random_state. See aws_sample_same_protein for the class-level anti-leakage / redundancy filters (max_similarity_to_test, max_similarity_within_ref), which apply identically here.

Note the asymmetry with the other sampling methods: sample_motif_matched always returns high-scoring matches and does not accept motif_match. For the inverse (windows that explicitly do not match a motif), call sample_different_protein with the same motif_pwm and motif_match='out'.

df_a = aaws.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                    n=3, window_size=3,
                                    motif_pwm=pwm, motif_score_threshold=2.5,
                                    seed=21)
df_b = aaws.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                    n=3, window_size=3,
                                    motif_pwm=pwm, motif_score_threshold=2.5,
                                    seed=21)
print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))

deterministic: True

Further parameters. AAWindowSampler.sample_motif_matched also accepts a per-residue context filter and the label pair used in 'sequences' mode: aa_context_col (a df_seq column with one single-character tag per residue), context_in / context_out (whitelist / blacklist of tag values that restrict which residues can anchor a scored window), and label_test / label_ref (labels written at positives vs. sampled motif-matched positions when output_mode='sequences'). The cell below tags each residue 'M' inside its alanine-rich stretches and 'C' elsewhere, then scores only the 'M' residues.

# Per-residue context filter: only 'M' (alanine-rich) residues may anchor a scored window
df_seq_ctx = df_seq.assign(
    topo=["".join("M" if aa == "A" else "C" for aa in seq) for seq in df_seq["sequence"]]
)
df_ctx = aaws.sample_motif_matched(df_seq=df_seq_ctx, pos_col="pos",
                                   n=5, window_size=3,
                                   motif_pwm=pwm, motif_score_threshold=2.0,
                                   aa_context_col="topo",   # per-residue tag column
                                   context_in="M",          # keep residues tagged 'M'
                                   context_out="C",          # drop residues tagged 'C'
                                   seed=0)
aa.display_df(df=df_ctx[["entry_win", "window", "motif_score"]], show_shape=True)

# label_test / label_ref define the per-residue labels in 'sequences' mode
df_lab = aaws.sample_motif_matched(df_seq=df_seq, pos_col="pos",
                                   n=3, window_size=3,
                                   motif_pwm=pwm, motif_score_threshold=2.5,
                                   output_mode="sequences",
                                   label_test=1, label_ref=-1,
                                   seed=0)
aa.display_df(df=df_lab, show_shape=True)

DataFrame shape: (5, 3)

	entry_win	window	motif_score
1	P2_1-3	AAA	3.000000
2	P3_5-7	AAA	3.000000
3	P3_12-14	AAA	3.000000
4	P2_2-4	AAC	2.000000
5	P3_4-6	GAA	2.000000

DataFrame shape: (3, 3)

	entry	sequence	labels
1	P1	ACDEFGHIK	[None, None, None, None, 1, None, None, None, None]
2	P2	AAACDEFGHIKLMNPQRSTV	[None, -1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
3	P3	GGGGAAAGGGGAAAGGGGAA	[None, None, None, None, None, -1, None, None, None, None, None, None, -1, None, None, None, None, None, None, None]