scan_motif

class scan_motif(df_seq, pos_col='pos', n=100, window_size=9, *, motif_pwm, pvalue_threshold=0.0001, label_test=1, label_ref=0, role='Negative', output_mode='segments', max_stored_scores=None, bg_file=None, motif_pseudo=None)[source]

Bases:

Scan candidate proteins for statistically significant Position Weight Matrix (PWM) occurrences using FIMO ([pro], requires aaanalysis[pro]).

scan_motif is a Command-Line Interface (CLI) wrapper around FIMO (Find Individual Motif Occurrences) [Bailey09], [Grant11] from the MEME (Multiple Em for Motif Elicitation) suite. Unlike the pure-Python AAWindowSampler.sample_motif_matched() (which keeps windows whose raw per-position PWM sum is >= motif_score_threshold), scan_motif lets FIMO perform its own probabilistic matching: each window is scored against FIMO’s background model (its built-in protein frequencies, or bg_file) and kept only when its match p-value is below pvalue_threshold. The two thus select different windows and are complementary ways to mine motif-matched training data. The output schema matches AAWindowSampler.sample_motif_matched(), with motif_score holding FIMO’s log-odds score and an added p_value column (segments mode).

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Rows are split into positive and candidate rows by pos_col; candidate rows feed the FIMO scan.

  • pos_col (str, default='pos') – Column with per-row 1-based positive positions; non-empty cells mark positive rows (excluded from the scan), empty / None / NaN cells mark candidate rows.

  • n (int, default=100) – Maximum number of motif-matched windows to return.

  • window_size (int, default=9) – Window length; must equal the first dimension of motif_pwm.

  • motif_pwm (pd.DataFrame) – Position Weight Matrix of shape (window_size, 20) whose columns are the 20 canonical amino-acid letters in any order (reindexed internally to ut.LIST_CANONICAL_AA). Required.

  • pvalue_threshold (float, default=1e-4) – FIMO match-p-value cutoff in [0, 1] (maps to fimo --thresh); only occurrences with a match p-value below this are reported. Smaller is stricter.

  • label_test (int or float, default=1) – Label assigned to positives in output_mode='sequences'.

  • label_ref (int or float, default=0) – Label assigned to sampled motif-matched rows.

  • role (str, default='Negative') – Role tag stored in the output’s role column.

  • output_mode ({'segments', 'sequences'}, default='segments') –

    Output schema (same as AAWindowSampler.sample_motif_matched()); see Notes:

    • 'segments': one row per matched window — the window string plus its provenance columns (entry, source_position, role, …).

    • 'sequences': one row per source protein with a per-residue labels list (label_test at matched positions, label_ref elsewhere).

  • max_stored_scores (int, optional) – Maximum number of motif occurrences FIMO may store internally before truncating. FIMO’s default is 100 000; raise this only when scanning very large candidate sets and FIMO reports truncation.

  • bg_file (str or pathlib.Path, optional) – Path to a MEME-format background amino-acid frequency file. When omitted, FIMO uses its built-in protein background.

  • motif_pseudo (float, optional) – Pseudocount applied to the motif before scanning (FIMO’s default is 0.1). Pass 0.0 to disable smoothing.

Returns:

df_hits – Significant motif hits, one row per matched window, ranked by ascending match p-value. In output_mode='segments' the schema of AAWindowSampler.sample_motif_matched() is extended with FIMO’s motif_score and p_value columns.

Return type:

pd.DataFrame

Raises:
  • RuntimeError – If the fimo binary is not on PATH.

  • ValueError – If motif_pwm is not provided, if pvalue_threshold is outside [0, 1], if bg_file is set but does not point to an existing file, or if df_seq contains no eligible candidate proteins (rows without test positions).

Notes

  • Candidate sequences are written to a temporary FASTA and the PWM to a temporary MEME letter-probability file (column order remapped to ut.STR_MEME_PROTEIN_ALPHABET); fimo runs in --text mode.

  • FIMO selects and scores the hits at --thresh pvalue_threshold; AAanalysis only ranks them by ascending p-value (deterministic tiebreak by descending score, then entry and start) and caps at n. motif_score is FIMO’s log-odds score, not a PWM-sum.

  • max_stored_scores / bg_file / motif_pseudo map to the FIMO flags --max-stored-scores / --bgfile / --motif-pseudo and genuinely change the reported hits (they tune FIMO’s scoring).

  • Protein-only: the 20 canonical amino acids are passed to MEME as the alphabet; gapped or non-protein alphabets are not supported.

See also

Examples

scan_motif searches candidate protein sequences for statistically significant occurrences of a sequence motif described by a Position Weight Matrix (PWM). It is a thin wrapper around FIMO (Find Individual Motif Occurrences) from the MEME (Multiple Em for Motif Elicitation) Suite, run through its Command-Line Interface (CLI), so the MEME Suite (the fimo binary) must be installed separately (e.g. conda install -c bioconda meme) in addition to aaanalysis[pro].

You bring a PWM (what the motif looks like) and a set of proteins, and scan_motif asks FIMO where the motif occurs and how surprising each occurrence is under a background amino-acid model. It keeps the windows whose match p-value is below your threshold, ranks them from most to least significant, and returns them as ready-to-label training windows.

Compared with the pure-Python AAWindowSampler.sample_motif_matched, which selects windows by a raw per-position PWM sum and needs no external tool, scan_motif uses FIMO’s probabilistic, background-aware scoring (a p-value). The two therefore select different windows: reach for scan_motif when you want significance-calibrated hits, and for the sampler when you want a dependency-free PWM-sum scan.

It returns one row per matched window (default output_mode='segments'): the standard window schema plus FIMO’s motif_score (a log-odds score) and a p_value column.

import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut

aa.options["verbose"] = False

# A tiny example with three proteins. 'P1' is a known positive (its 'pos' entry
# lists residue 3), so it is EXCLUDED from the scan; 'C1' and 'C2' are candidates.
df_seq = pd.DataFrame({
    "entry": ["P1", "C1", "C2"],
    "sequence": ["MKLVAAAAAGTR", "QWEAAAAACDEF", "GHIKAAAAALMN"],
    "pos": [[3], [], []],
})

# A Position Weight Matrix (5 positions x 20 amino acids) favouring an
# Alanine ('A') run. Columns are the 20 canonical amino acids.
motif_pwm = pd.DataFrame(0.01, index=range(5), columns=list(ut.LIST_CANONICAL_AA))
motif_pwm["A"] = 0.81

# Smallest call: scan the candidates for significant matches of the motif.
df_hits = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                        pvalue_threshold=1e-2)
aa.display_df(df_hits, show_shape=True)
DataFrame shape: (12, 10)
  entry_win entry sequence window source_position label role strategy motif_score p_value
1 C1_4-8 C1 QWEAAAAACDEF AAAAA 6 0 Negative motif_matched 16.719700 0.000002
2 C2_5-9 C2 GHIKAAAAALMN AAAAA 7 0 Negative motif_matched 16.719700 0.000002
3 C1_5-9 C1 QWEAAAAACDEF AAAAC 7 0 Negative motif_matched 12.617800 0.000007
4 C2_4-8 C2 GHIKAAAAALMN KAAAA 6 0 Negative motif_matched 11.356700 0.000075
5 C1_3-7 C1 QWEAAAAACDEF EAAAA 5 0 Negative motif_matched 11.299400 0.000092
6 C2_6-10 C2 GHIKAAAAALMN AAAAL 8 0 Negative motif_matched 10.980900 0.000135
7 C1_2-6 C1 QWEAAAAACDEF WEAAA 4 0 Negative motif_matched 7.579620 0.000238
8 C1_6-10 C1 QWEAAAAACDEF AAACD 8 0 Negative motif_matched 7.369430 0.000286
9 C2_7-11 C2 GHIKAAAAALMN AAALM 9 0 Negative motif_matched 6.592360 0.000787
10 C2_3-7 C2 GHIKAAAAALMN IKAAA 5 0 Negative motif_matched 6.025480 0.001960
11 C1_1-5 C1 QWEAAAAACDEF QWEAA 3 0 Negative motif_matched 2.566880 0.004600
12 C1_7-11 C1 QWEAAAAACDEF AACDE 9 0 Negative motif_matched 1.949040 0.007470

df_seq holds the proteins (entry + sequence columns). pos_col (default 'pos') decides which rows are scanned: rows with a non-empty position are treated as known positives and excluded, while rows with an empty cell are the candidates FIMO scans. Note that P1 contains the same AAAAA motif as the candidates, yet never appears in the results — because it is a positive.

print("Candidate entries with a hit:", sorted(df_hits["entry"].unique()))
print("'P1' excluded despite containing the motif:", "P1" not in set(df_hits["entry"]))
Candidate entries with a hit: ['C1', 'C2']
'P1' excluded despite containing the motif: True

motif_pwm is the motif (shape (window_size, 20)) and window_size must equal its number of rows. pvalue_threshold is FIMO’s significance cutoff (a probability in [0, 1], mapped to fimo --thresh): smaller is stricter, keeping fewer but more significant windows.

strict = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                       pvalue_threshold=1e-4)
loose = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                      pvalue_threshold=1e-1)
print(f"strict (p<1e-4): {len(strict)} hits;  loose (p<1e-1): {len(loose)} hits")
strict (p<1e-4): 5 hits;  loose (p<1e-1): 15 hits

output_mode picks the schema: 'segments' (default) returns one row per window; 'sequences' returns one row per protein with a per-residue labels list. n caps how many windows are returned (most significant first). role, label_ref, and label_test tag the rows for positive-unlabeled (PU) learning / hard-negative-mining workflows: label_ref/role mark the scanned windows, and (in 'sequences' mode) label_test marks the known-positive positions.

# Segments mode: cap to the 3 most significant windows, tag them as negatives.
df_seg = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                       pvalue_threshold=1e-2, output_mode="segments",
                       n=3, role="Negative", label_ref=0)
aa.display_df(df_seg, show_shape=True)

# Sequences mode: per-residue labels (positives = label_test, hits = label_ref).
df_lab = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                       pvalue_threshold=1e-2, output_mode="sequences",
                       label_test=1, label_ref=0)
aa.display_df(df_lab, show_shape=True)
DataFrame shape: (3, 10)
  entry_win entry sequence window source_position label role strategy motif_score p_value
1 C1_4-8 C1 QWEAAAAACDEF AAAAA 6 0 Negative motif_matched 16.719700 0.000002
2 C2_5-9 C2 GHIKAAAAALMN AAAAA 7 0 Negative motif_matched 16.719700 0.000002
3 C1_5-9 C1 QWEAAAAACDEF AAAAC 7 0 Negative motif_matched 12.617800 0.000007
DataFrame shape: (3, 3)
  entry sequence labels
1 P1 MKLVAAAAAGTR [None, None, 1, None, None, None, None, None, None, None, None, None]
2 C1 QWEAAAAACDEF [None, None, 0, 0, 0, 0, 0, 0, 0, None, None, None]
3 C2 GHIKAAAAALMN [None, None, None, None, 0, 0, 0, 0, 0, None, None, None]

These pass straight through to FIMO and change how it scores occurrences:

  • motif_pseudo — pseudocount added to the motif before scoring (FIMO default 0.1); pass 0.0 to disable smoothing.

  • max_stored_scores — cap on the number of occurrences FIMO stores internally (FIMO default 100000); raise it only for very large candidate sets.

  • bg_file — path to a MEME-format background amino-acid frequency file; when omitted, FIMO uses its built-in protein background. Changing the background changes the match p-values.

df_tuned = aa.scan_motif(df_seq=df_seq, motif_pwm=motif_pwm, window_size=5,
                         pvalue_threshold=1e-2, motif_pseudo=0.0,
                         max_stored_scores=1_000_000)
aa.display_df(df_tuned, show_shape=True)
DataFrame shape: (13, 10)
  entry_win entry sequence window source_position label role strategy motif_score p_value
1 C1_4-8 C1 QWEAAAAACDEF AAAAA 6 0 Negative motif_matched 17.330800 0.000002
2 C2_5-9 C2 GHIKAAAAALMN AAAAA 7 0 Negative motif_matched 17.330800 0.000002
3 C1_5-9 C1 QWEAAAAACDEF AAAAC 7 0 Negative motif_matched 13.000000 0.000007
4 C2_4-8 C2 GHIKAAAAALMN KAAAA 6 0 Negative motif_matched 11.315800 0.000075
5 C1_3-7 C1 QWEAAAAACDEF EAAAA 5 0 Negative motif_matched 11.225600 0.000092
6 C2_6-10 C2 GHIKAAAAALMN AAAAL 8 0 Negative motif_matched 10.669200 0.000135
7 C1_2-6 C1 QWEAAAAACDEF WEAAA 4 0 Negative motif_matched 7.345860 0.000243
8 C1_6-10 C1 QWEAAAAACDEF AAACD 8 0 Negative motif_matched 7.165410 0.000307
9 C2_7-11 C2 GHIKAAAAALMN AAALM 9 0 Negative motif_matched 6.000000 0.000960
10 C2_3-7 C2 GHIKAAAAALMN IKAAA 5 0 Negative motif_matched 5.353380 0.001960
11 C1_1-5 C1 QWEAAAAACDEF QWEAA 3 0 Negative motif_matched 1.849620 0.004680
12 C1_7-11 C1 QWEAAAAACDEF AACDE 9 0 Negative motif_matched 1.060150 0.007620
13 C2_2-6 C2 GHIKAAAAALMN HIKAA 4 0 Negative motif_matched 0.721805 0.009930