scan_motif
- class scan_motif(df_seq, pos_col='pos', n=100, window_size=9, *, motif_pwm, pvalue_threshold=0.0001, label_test=1, label_ref=0, role='Negative', output_mode='segments', max_stored_scores=None, bg_file=None, motif_pseudo=None)[source]
Bases:
Scan candidate proteins for statistically significant Position Weight Matrix (PWM) occurrences using FIMO ([pro], requires
aaanalysis[pro]).scan_motifis a Command-Line Interface (CLI) wrapper around FIMO (Find Individual Motif Occurrences) [Bailey09], [Grant11] from the MEME (Multiple Em for Motif Elicitation) suite. Unlike the pure-PythonAAWindowSampler.sample_motif_matched()(which keeps windows whose raw per-position PWM sum is>= motif_score_threshold),scan_motiflets FIMO perform its own probabilistic matching: each window is scored against FIMO’s background model (its built-in protein frequencies, orbg_file) and kept only when its match p-value is belowpvalue_threshold. The two thus select different windows and are complementary ways to mine motif-matched training data. The output schema matchesAAWindowSampler.sample_motif_matched(), withmotif_scoreholding FIMO’s log-odds score and an addedp_valuecolumn (segments mode).Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Rows are split into positive and candidate rows bypos_col; candidate rows feed the FIMO scan.pos_col (str, default='pos') – Column with per-row 1-based positive positions; non-empty cells mark positive rows (excluded from the scan), empty /
None/NaNcells mark candidate rows.n (int, default=100) – Maximum number of motif-matched windows to return.
window_size (int, default=9) – Window length; must equal the first dimension of
motif_pwm.motif_pwm (pd.DataFrame) – Position Weight Matrix of shape
(window_size, 20)whose columns are the 20 canonical amino-acid letters in any order (reindexed internally tout.LIST_CANONICAL_AA). Required.pvalue_threshold (float, default=1e-4) – FIMO match-p-value cutoff in
[0, 1](maps tofimo --thresh); only occurrences with a match p-value below this are reported. Smaller is stricter.label_test (int or float, default=1) – Label assigned to positives in
output_mode='sequences'.label_ref (int or float, default=0) – Label assigned to sampled motif-matched rows.
role (str, default='Negative') – Role tag stored in the output’s
rolecolumn.output_mode ({'segments', 'sequences'}, default='segments') –
Output schema (same as
AAWindowSampler.sample_motif_matched()); see Notes:'segments': one row per matched window — the window string plus its provenance columns (entry,source_position,role, …).'sequences': one row per source protein with a per-residuelabelslist (label_testat matched positions,label_refelsewhere).
max_stored_scores (int, optional) – Maximum number of motif occurrences FIMO may store internally before truncating. FIMO’s default is 100 000; raise this only when scanning very large candidate sets and FIMO reports truncation.
bg_file (str or pathlib.Path, optional) – Path to a MEME-format background amino-acid frequency file. When omitted, FIMO uses its built-in protein background.
motif_pseudo (float, optional) – Pseudocount applied to the motif before scanning (FIMO’s default is
0.1). Pass0.0to disable smoothing.
- Returns:
df_hits – Significant motif hits, one row per matched window, ranked by ascending match p-value. In
output_mode='segments'the schema ofAAWindowSampler.sample_motif_matched()is extended with FIMO’smotif_scoreandp_valuecolumns.- Return type:
pd.DataFrame
- Raises:
RuntimeError – If the
fimobinary is not on PATH.ValueError – If
motif_pwmis not provided, ifpvalue_thresholdis outside[0, 1], ifbg_fileis set but does not point to an existing file, or ifdf_seqcontains no eligible candidate proteins (rows without test positions).
Notes
Candidate sequences are written to a temporary FASTA and the PWM to a temporary MEME letter-probability file (column order remapped to
ut.STR_MEME_PROTEIN_ALPHABET);fimoruns in--textmode.FIMO selects and scores the hits at
--thresh pvalue_threshold; AAanalysis only ranks them by ascending p-value (deterministic tiebreak by descending score, thenentryand start) and caps atn.motif_scoreis FIMO’s log-odds score, not a PWM-sum.max_stored_scores/bg_file/motif_pseudomap to the FIMO flags--max-stored-scores/--bgfile/--motif-pseudoand genuinely change the reported hits (they tune FIMO’s scoring).Protein-only: the 20 canonical amino acids are passed to MEME as the alphabet; gapped or non-protein alphabets are not supported.
See also
MEME Suite documentation and FIMO manual.
AAWindowSampler.sample_motif_matched()for the pure-Python PWM-sum sampler (no FIMO binary required) that selects windows by a different criterion.
Examples
scan_motifsearches candidate protein sequences for statistically significant occurrences of a sequence motif described by a Position Weight Matrix (PWM). It is a thin wrapper around FIMO (Find Individual Motif Occurrences) from the MEME (Multiple Em for Motif Elicitation) Suite, run through its Command-Line Interface (CLI), so the MEME Suite (thefimobinary) must be installed separately (e.g.conda install -c bioconda meme) in addition toaaanalysis[pro].You bring a PWM (what the motif looks like) and a set of proteins, and
scan_motifasks FIMO where the motif occurs and how surprising each occurrence is under a background amino-acid model. It keeps the windows whose match p-value is below your threshold, ranks them from most to least significant, and returns them as ready-to-label training windows.Compared with the pure-Python
AAWindowSampler.sample_motif_matched, which selects windows by a raw per-position PWM sum and needs no external tool,scan_motifuses FIMO’s probabilistic, background-aware scoring (a p-value). The two therefore select different windows: reach forscan_motifwhen you want significance-calibrated hits, and for the sampler when you want a dependency-free PWM-sum scan.It returns one row per matched window (default
output_mode='segments'): the standard window schema plus FIMO’smotif_score(a log-odds score) and ap_valuecolumn.import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options["verbose"] = False # A tiny example with three proteins. 'P1' is a known positive (its 'pos' entry # lists residue 3), so it is EXCLUDED from the scan; 'C1' and 'C2' are candidates. df_seq = pd.DataFrame({ "entry": ["P1", "C1", "C2"], "sequence": ["MKLVAAAAAGTR", "QWEAAAAACDEF", "GHIKAAAAALMN"], "pos": [[3], [], []], }) # A Position Weight Matrix (5 positions x 20 amino acids) favouring an # Alanine ('A') run. Columns are the 20 canonical amino acids. motif_pwm = pd.DataFrame(0.01, index=range(5), columns=list(ut.LIST_CANONICAL_AA)) motif_pwm["A"] = 0.81 # Smallest call: scan the candidates for significant matches of the motif. df_hits = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5, pvalue_threshold=1e-2) aa.display_df(df_hits, show_shape=True)
DataFrame shape: (12, 10)
entry_win entry sequence window source_position label role strategy motif_score p_value 1 C1_4-8 C1 QWEAAAAACDEF AAAAA 6 0 Negative motif_matched 16.719700 0.000002 2 C2_5-9 C2 GHIKAAAAALMN AAAAA 7 0 Negative motif_matched 16.719700 0.000002 3 C1_5-9 C1 QWEAAAAACDEF AAAAC 7 0 Negative motif_matched 12.617800 0.000007 4 C2_4-8 C2 GHIKAAAAALMN KAAAA 6 0 Negative motif_matched 11.356700 0.000075 5 C1_3-7 C1 QWEAAAAACDEF EAAAA 5 0 Negative motif_matched 11.299400 0.000092 6 C2_6-10 C2 GHIKAAAAALMN AAAAL 8 0 Negative motif_matched 10.980900 0.000135 7 C1_2-6 C1 QWEAAAAACDEF WEAAA 4 0 Negative motif_matched 7.579620 0.000238 8 C1_6-10 C1 QWEAAAAACDEF AAACD 8 0 Negative motif_matched 7.369430 0.000286 9 C2_7-11 C2 GHIKAAAAALMN AAALM 9 0 Negative motif_matched 6.592360 0.000787 10 C2_3-7 C2 GHIKAAAAALMN IKAAA 5 0 Negative motif_matched 6.025480 0.001960 11 C1_1-5 C1 QWEAAAAACDEF QWEAA 3 0 Negative motif_matched 2.566880 0.004600 12 C1_7-11 C1 QWEAAAAACDEF AACDE 9 0 Negative motif_matched 1.949040 0.007470 df_seqholds the proteins (entry+sequencecolumns).pos_col(default'pos') decides which rows are scanned: rows with a non-empty position are treated as known positives and excluded, while rows with an empty cell are the candidates FIMO scans. Note thatP1contains the sameAAAAAmotif as the candidates, yet never appears in the results — because it is a positive.print("Candidate entries with a hit:", sorted(df_hits["entry"].unique())) print("'P1' excluded despite containing the motif:", "P1" not in set(df_hits["entry"]))
Candidate entries with a hit: ['C1', 'C2'] 'P1' excluded despite containing the motif: True
motif_pwmis the motif (shape(window_size, 20)) andwindow_sizemust equal its number of rows.pvalue_thresholdis FIMO’s significance cutoff (a probability in[0, 1], mapped tofimo --thresh): smaller is stricter, keeping fewer but more significant windows.strict = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5, pvalue_threshold=1e-4) loose = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5, pvalue_threshold=1e-1) print(f"strict (p<1e-4): {len(strict)} hits; loose (p<1e-1): {len(loose)} hits")
strict (p<1e-4): 5 hits; loose (p<1e-1): 15 hits
output_modepicks the schema:'segments'(default) returns one row per window;'sequences'returns one row per protein with a per-residuelabelslist.ncaps how many windows are returned (most significant first).role,label_ref, andlabel_testtag the rows for positive-unlabeled (PU) learning / hard-negative-mining workflows:label_ref/rolemark the scanned windows, and (in'sequences'mode)label_testmarks the known-positive positions.# Segments mode: cap to the 3 most significant windows, tag them as negatives. df_seg = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5, pvalue_threshold=1e-2, output_mode="segments", n=3, role="Negative", label_ref=0) aa.display_df(df_seg, show_shape=True) # Sequences mode: per-residue labels (positives = label_test, hits = label_ref). df_lab = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5, pvalue_threshold=1e-2, output_mode="sequences", label_test=1, label_ref=0) aa.display_df(df_lab, show_shape=True)
DataFrame shape: (3, 10)
entry_win entry sequence window source_position label role strategy motif_score p_value 1 C1_4-8 C1 QWEAAAAACDEF AAAAA 6 0 Negative motif_matched 16.719700 0.000002 2 C2_5-9 C2 GHIKAAAAALMN AAAAA 7 0 Negative motif_matched 16.719700 0.000002 3 C1_5-9 C1 QWEAAAAACDEF AAAAC 7 0 Negative motif_matched 12.617800 0.000007 DataFrame shape: (3, 3)
entry sequence labels 1 P1 MKLVAAAAAGTR [None, None, 1, None, None, None, None, None, None, None, None, None] 2 C1 QWEAAAAACDEF [None, None, 0, 0, 0, 0, 0, 0, 0, None, None, None] 3 C2 GHIKAAAAALMN [None, None, None, None, 0, 0, 0, 0, 0, None, None, None] These pass straight through to FIMO and change how it scores occurrences:
motif_pseudo— pseudocount added to the motif before scoring (FIMO default0.1); pass0.0to disable smoothing.max_stored_scores— cap on the number of occurrences FIMO stores internally (FIMO default100000); raise it only for very large candidate sets.bg_file— path to a MEME-format background amino-acid frequency file; when omitted, FIMO uses its built-in protein background. Changing the background changes the match p-values.
df_tuned = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5, pvalue_threshold=1e-2, motif_pseudo=0.0, max_stored_scores=1_000_000) aa.display_df(df_tuned, show_shape=True)
DataFrame shape: (13, 10)
entry_win entry sequence window source_position label role strategy motif_score p_value 1 C1_4-8 C1 QWEAAAAACDEF AAAAA 6 0 Negative motif_matched 17.330800 0.000002 2 C2_5-9 C2 GHIKAAAAALMN AAAAA 7 0 Negative motif_matched 17.330800 0.000002 3 C1_5-9 C1 QWEAAAAACDEF AAAAC 7 0 Negative motif_matched 13.000000 0.000007 4 C2_4-8 C2 GHIKAAAAALMN KAAAA 6 0 Negative motif_matched 11.315800 0.000075 5 C1_3-7 C1 QWEAAAAACDEF EAAAA 5 0 Negative motif_matched 11.225600 0.000092 6 C2_6-10 C2 GHIKAAAAALMN AAAAL 8 0 Negative motif_matched 10.669200 0.000135 7 C1_2-6 C1 QWEAAAAACDEF WEAAA 4 0 Negative motif_matched 7.345860 0.000243 8 C1_6-10 C1 QWEAAAAACDEF AAACD 8 0 Negative motif_matched 7.165410 0.000307 9 C2_7-11 C2 GHIKAAAAALMN AAALM 9 0 Negative motif_matched 6.000000 0.000960 10 C2_3-7 C2 GHIKAAAAALMN IKAAA 5 0 Negative motif_matched 5.353380 0.001960 11 C1_1-5 C1 QWEAAAAACDEF QWEAA 3 0 Negative motif_matched 1.849620 0.004680 12 C1_7-11 C1 QWEAAAAACDEF AACDE 9 0 Negative motif_matched 1.060150 0.007620 13 C2_2-6 C2 GHIKAAAAALMN HIKAA 4 0 Negative motif_matched 0.721805 0.009930