scan_motif

class scan_motif(df_seq, pos_col='pos', n=100, window_size=9, *, motif_pwm, pvalue_threshold=0.0001, label_test=1, label_ref=0, role='Negative', output_mode='segments', max_stored_scores=None, bg_file=None, motif_pseudo=None)[source]

Bases:

Scan candidate proteins for statistically significant Position Weight Matrix (PWM) occurrences using FIMO ([pro], requires aaanalysis[pro]).

scan_motif is a Command-Line Interface (CLI) wrapper around FIMO (Find Individual Motif Occurrences) [Bailey09], [Grant11] from the MEME (Multiple Em for Motif Elicitation) suite. Unlike the pure-Python AAWindowSampler.sample_motif_matched() (which keeps windows whose raw per-position PWM sum is >= motif_score_threshold), scan_motif lets FIMO perform its own probabilistic matching: each window is scored against FIMO’s background model (its built-in protein frequencies, or bg_file) and kept only when its match p-value is below pvalue_threshold. The two thus select different windows and are complementary ways to mine motif-matched training data. The output schema matches AAWindowSampler.sample_motif_matched(), with motif_score holding FIMO’s log-odds score and an added p_value column (segments mode).

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Rows are split into positive and candidate rows by pos_col; candidate rows feed the FIMO scan.
pos_col (str, default='pos') – Column with per-row 1-based positive positions; non-empty cells mark positive rows (excluded from the scan), empty / None / NaN cells mark candidate rows.
n (int, default=100) – Maximum number of motif-matched windows to return.
window_size (int, default=9) – Window length; must equal the first dimension of motif_pwm.
motif_pwm (pd.DataFrame) – Position Weight Matrix of shape (window_size, 20) whose columns are the 20 canonical amino-acid letters in any order (reindexed internally to ut.LIST_CANONICAL_AA). Required.
pvalue_threshold (float, default=1e-4) – FIMO match-p-value cutoff in [0, 1] (maps to fimo --thresh); only occurrences with a match p-value below this are reported. Smaller is stricter.
label_test (int or float, default=1) – Label assigned to positives in output_mode='sequences'.
label_ref (int or float, default=0) – Label assigned to sampled motif-matched rows.
role (str, default='Negative') – Role tag stored in the output’s role column.
output_mode ({'segments', 'sequences'}, default='segments') –
Output schema (same as AAWindowSampler.sample_motif_matched()); see Notes:
- 'segments': one row per matched window — the window string plus its provenance columns (entry, source_position, role, …).
- 'sequences': one row per source protein with a per-residue labels list (label_test at matched positions, label_ref elsewhere).
max_stored_scores (int, optional) – Maximum number of motif occurrences FIMO may store internally before truncating. FIMO’s default is 100 000; raise this only when scanning very large candidate sets and FIMO reports truncation.
bg_file (str or pathlib.Path, optional) – Path to a MEME-format background amino-acid frequency file. When omitted, FIMO uses its built-in protein background.
motif_pseudo (float, optional) – Pseudocount applied to the motif before scanning (FIMO’s default is 0.1). Pass 0.0 to disable smoothing.

Returns:

df_hits – Significant motif hits, one row per matched window, ranked by ascending match p-value. In output_mode='segments' the schema of AAWindowSampler.sample_motif_matched() is extended with FIMO’s motif_score and p_value columns.

Return type:

pd.DataFrame

Raises:

RuntimeError – If the fimo binary is not on PATH.
ValueError – If motif_pwm is not provided, if pvalue_threshold is outside [0, 1], if bg_file is set but does not point to an existing file, or if df_seq contains no eligible candidate proteins (rows without test positions).

Notes

Candidate sequences are written to a temporary FASTA and the PWM to a temporary MEME letter-probability file (column order remapped to ut.STR_MEME_PROTEIN_ALPHABET); fimo runs in --text mode.
FIMO selects and scores the hits at --thresh pvalue_threshold; AAanalysis only ranks them by ascending p-value (deterministic tiebreak by descending score, then entry and start) and caps at n. motif_score is FIMO’s log-odds score, not a PWM-sum.
max_stored_scores / bg_file / motif_pseudo map to the FIMO flags --max-stored-scores / --bgfile / --motif-pseudo and genuinely change the reported hits (they tune FIMO’s scoring).
Protein-only: the 20 canonical amino acids are passed to MEME as the alphabet; gapped or non-protein alphabets are not supported.

See also

MEME Suite documentation and FIMO manual.
AAWindowSampler.sample_motif_matched() for the pure-Python PWM-sum sampler (no FIMO binary required) that selects windows by a different criterion.

Examples

scan_motif searches candidate protein sequences for statistically significant occurrences of a sequence motif described by a Position Weight Matrix (PWM). It is a thin wrapper around FIMO (Find Individual Motif Occurrences) from the MEME (Multiple Em for Motif Elicitation) Suite, run through its Command-Line Interface (CLI), so the MEME Suite (the fimo binary) must be installed separately (e.g. conda install -c bioconda meme) in addition to aaanalysis[pro].

You bring a PWM (what the motif looks like) and a set of proteins, and scan_motif asks FIMO where the motif occurs and how surprising each occurrence is under a background amino-acid model. It keeps the windows whose match p-value is below your threshold, ranks them from most to least significant, and returns them as ready-to-label training windows.

Compared with the pure-Python AAWindowSampler.sample_motif_matched, which selects windows by a raw per-position PWM sum and needs no external tool, scan_motif uses FIMO’s probabilistic, background-aware scoring (a p-value). The two therefore select different windows: reach for scan_motif when you want significance-calibrated hits, and for the sampler when you want a dependency-free PWM-sum scan.

It returns one row per matched window (default output_mode='segments'): the standard window schema plus FIMO’s motif_score (a log-odds score) and a p_value column.

import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut

aa.options["verbose"] = False

# A tiny example with three proteins. 'P1' is a known positive (its 'pos' entry
# lists residue 3), so it is EXCLUDED from the scan; 'C1' and 'C2' are candidates.
df_seq = pd.DataFrame({
    "entry": ["P1", "C1", "C2"],
    "sequence": ["MKLVAAAAAGTR", "QWEAAAAACDEF", "GHIKAAAAALMN"],
    "pos": [[3], [], []],
})

# A Position Weight Matrix (5 positions x 20 amino acids) favouring an
# Alanine ('A') run. Columns are the 20 canonical amino acids.
motif_pwm = pd.DataFrame(0.01, index=range(5), columns=list(ut.LIST_CANONICAL_AA))
motif_pwm["A"] = 0.81

# Smallest call: scan the candidates for significant matches of the motif.
df_hits = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                        pvalue_threshold=1e-2)
aa.display_df(df_hits, show_shape=True)

DataFrame shape: (12, 10)

	entry_win	entry	sequence	window	source_position	role	strategy	motif_score	p_value
1	C1_4-8	C1	QWEAAAAACDEF	AAAAA	6	Negative	motif_matched	16.719700	0.000002
2	C2_5-9	C2	GHIKAAAAALMN	AAAAA	7	Negative	motif_matched	16.719700	0.000002
3	C1_5-9	C1	QWEAAAAACDEF	AAAAC	7	Negative	motif_matched	12.617800	0.000007
4	C2_4-8	C2	GHIKAAAAALMN	KAAAA	6	Negative	motif_matched	11.356700	0.000075
5	C1_3-7	C1	QWEAAAAACDEF	EAAAA	5	Negative	motif_matched	11.299400	0.000092
6	C2_6-10	C2	GHIKAAAAALMN	AAAAL	8	Negative	motif_matched	10.980900	0.000135
7	C1_2-6	C1	QWEAAAAACDEF	WEAAA	4	Negative	motif_matched	7.579620	0.000238
8	C1_6-10	C1	QWEAAAAACDEF	AAACD	8	Negative	motif_matched	7.369430	0.000286
9	C2_7-11	C2	GHIKAAAAALMN	AAALM	9	Negative	motif_matched	6.592360	0.000787
10	C2_3-7	C2	GHIKAAAAALMN	IKAAA	5	Negative	motif_matched	6.025480	0.001960
11	C1_1-5	C1	QWEAAAAACDEF	QWEAA	3	Negative	motif_matched	2.566880	0.004600
12	C1_7-11	C1	QWEAAAAACDEF	AACDE	9	Negative	motif_matched	1.949040	0.007470

df_seq holds the proteins (entry + sequence columns). pos_col (default 'pos') decides which rows are scanned: rows with a non-empty position are treated as known positives and excluded, while rows with an empty cell are the candidates FIMO scans. Note that P1 contains the same AAAAA motif as the candidates, yet never appears in the results — because it is a positive.

print("Candidate entries with a hit:", sorted(df_hits["entry"].unique()))
print("'P1' excluded despite containing the motif:", "P1" not in set(df_hits["entry"]))

Candidate entries with a hit: ['C1', 'C2']
'P1' excluded despite containing the motif: True

# `pos_col` names the column holding known-positive positions. Renaming it and
# pointing `scan_motif` at the new name keeps P1 excluded just the same.
df_seq_named = df_seq.rename(columns={"pos": "positive_pos"})
df_hits_named = aa.scan_motif(df_seq=df_seq_named, motif_pwm=motif_pwm,
                              window_size=5, pvalue_threshold=1e-2,
                              pos_col="positive_pos")
print("P1 still excluded via pos_col='positive_pos':",
      "P1" not in set(df_hits_named["entry"]))

P1 still excluded via pos_col='positive_pos': True

motif_pwm is the motif (shape (window_size, 20)) and window_size must equal its number of rows. pvalue_threshold is FIMO’s significance cutoff (a probability in [0, 1], mapped to fimo --thresh): smaller is stricter, keeping fewer but more significant windows.

strict = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                       pvalue_threshold=1e-4)
loose = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                      pvalue_threshold=1e-1)
print(f"strict (p<1e-4): {len(strict)} hits;  loose (p<1e-1): {len(loose)} hits")

strict (p<1e-4): 5 hits;  loose (p<1e-1): 15 hits

output_mode picks the schema: 'segments' (default) returns one row per window; 'sequences' returns one row per protein with a per-residue labels list. n caps how many windows are returned (most significant first). role, label_ref, and label_test tag the rows for positive-unlabeled (PU) learning / hard-negative-mining workflows: label_ref/role mark the scanned windows, and (in 'sequences' mode) label_test marks the known-positive positions.

# Segments mode: cap to the 3 most significant windows, tag them as negatives.
df_seg = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                       pvalue_threshold=1e-2, output_mode="segments",
                       n=3, role="Negative", label_ref=0)
aa.display_df(df_seg, show_shape=True)

# Sequences mode: per-residue labels (positives = label_test, hits = label_ref).
df_lab = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                       pvalue_threshold=1e-2, output_mode="sequences",
                       label_test=1, label_ref=0)
aa.display_df(df_lab, show_shape=True)

DataFrame shape: (3, 10)

	entry_win	entry	sequence	window	source_position	role	strategy	motif_score	p_value
1	C1_4-8	C1	QWEAAAAACDEF	AAAAA	6	Negative	motif_matched	16.719700	0.000002
2	C2_5-9	C2	GHIKAAAAALMN	AAAAA	7	Negative	motif_matched	16.719700	0.000002
3	C1_5-9	C1	QWEAAAAACDEF	AAAAC	7	Negative	motif_matched	12.617800	0.000007

DataFrame shape: (3, 3)

	entry	sequence	labels
1	P1	MKLVAAAAAGTR	[None, None, 1, None, None, None, None, None, None, None, None, None]
2	C1	QWEAAAAACDEF	[None, None, 0, 0, 0, 0, 0, 0, 0, None, None, None]
3	C2	GHIKAAAAALMN	[None, None, None, None, 0, 0, 0, 0, 0, None, None, None]

These pass straight through to FIMO and change how it scores occurrences:

motif_pseudo — pseudocount added to the motif before scoring (FIMO default 0.1); pass 0.0 to disable smoothing.
max_stored_scores — cap on the number of occurrences FIMO stores internally (FIMO default 100000); raise it only for very large candidate sets.
bg_file — path to a MEME-format background amino-acid frequency file; when omitted, FIMO uses its built-in protein background. Changing the background changes the match p-values.

df_tuned = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                         pvalue_threshold=1e-2, motif_pseudo=0.0,
                         max_stored_scores=1_000_000)
aa.display_df(df_tuned, show_shape=True)

# `bg_file`: point FIMO at a MEME-format background. Here a uniform amino-acid
# background written to a temporary file (changing it changes the match p-values).
import os
import tempfile
bg_file = os.path.join(tempfile.mkdtemp(), "uniform_bg.txt")
with open(bg_file, "w") as fh:
    fh.write("# uniform amino-acid background (MEME format)\n")
    for _aa in ut.LIST_CANONICAL_AA:
        fh.write(f"{_aa} {1 / 20:.6f}\n")
df_bg = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                      pvalue_threshold=1e-2, bg_file=bg_file)
aa.display_df(df_bg, show_shape=True)

DataFrame shape: (13, 10)

	entry_win	entry	sequence	window	source_position	role	strategy	motif_score	p_value
1	C1_4-8	C1	QWEAAAAACDEF	AAAAA	6	Negative	motif_matched	17.330800	0.000002
2	C2_5-9	C2	GHIKAAAAALMN	AAAAA	7	Negative	motif_matched	17.330800	0.000002
3	C1_5-9	C1	QWEAAAAACDEF	AAAAC	7	Negative	motif_matched	13.000000	0.000007
4	C2_4-8	C2	GHIKAAAAALMN	KAAAA	6	Negative	motif_matched	11.315800	0.000075
5	C1_3-7	C1	QWEAAAAACDEF	EAAAA	5	Negative	motif_matched	11.225600	0.000092
6	C2_6-10	C2	GHIKAAAAALMN	AAAAL	8	Negative	motif_matched	10.669200	0.000135
7	C1_2-6	C1	QWEAAAAACDEF	WEAAA	4	Negative	motif_matched	7.345860	0.000243
8	C1_6-10	C1	QWEAAAAACDEF	AAACD	8	Negative	motif_matched	7.165410	0.000307
9	C2_7-11	C2	GHIKAAAAALMN	AAALM	9	Negative	motif_matched	6.000000	0.000960
10	C2_3-7	C2	GHIKAAAAALMN	IKAAA	5	Negative	motif_matched	5.353380	0.001960
11	C1_1-5	C1	QWEAAAAACDEF	QWEAA	3	Negative	motif_matched	1.849620	0.004680
12	C1_7-11	C1	QWEAAAAACDEF	AACDE	9	Negative	motif_matched	1.060150	0.007620
13	C2_2-6	C2	GHIKAAAAALMN	HIKAA	4	Negative	motif_matched	0.721805	0.009930

DataFrame shape: (10, 10)

	entry_win	entry	sequence	window	source_position	role	strategy	motif_score	p_value
1	C1_4-8	C1	QWEAAAAACDEF	AAAAA	6	Negative	motif_matched	19.437900	0.000000
2	C2_5-9	C2	GHIKAAAAALMN	AAAAA	7	Negative	motif_matched	19.437900	0.000000
3	C1_3-7	C1	QWEAAAAACDEF	EAAAA	5	Negative	motif_matched	13.674600	0.000030
4	C1_5-9	C1	QWEAAAAACDEF	AAAAC	7	Negative	motif_matched	13.674600	0.000030
5	C2_4-8	C2	GHIKAAAAALMN	KAAAA	6	Negative	motif_matched	13.674600	0.000030
6	C2_6-10	C2	GHIKAAAAALMN	AAAAL	8	Negative	motif_matched	13.674600	0.000030
7	C1_2-6	C1	QWEAAAAACDEF	WEAAA	4	Negative	motif_matched	7.911240	0.001160
8	C1_6-10	C1	QWEAAAAACDEF	AAACD	8	Negative	motif_matched	7.911240	0.001160
9	C2_3-7	C2	GHIKAAAAALMN	IKAAA	5	Negative	motif_matched	7.911240	0.001160
10	C2_7-11	C2	GHIKAAAAALMN	AAALM	9	Negative	motif_matched	7.911240	0.001160