aaanalysis.scan_motif
- class aaanalysis.scan_motif(df_seq=None, pos_col='pos', n=100, window_size=9, motif_pwm=None, motif_score_threshold=None, label_test=1, label_ref=0, role='Negative', output_mode='segments', max_stored_scores=None, bg_file=None, motif_pseudo=None)[source]
Bases:
Scan candidate proteins for windows matching a user-supplied PWM (Position Weight Matrix) using the FIMO CLI.
This is a CLI-based wrapper around FIMO [Bailey09], [Grant11] from the MEME suite that mirrors
AAWindowSampler.sample_motif_matched()with strict parity on the returned hit set: the samedf_seq,motif_pwm, andmotif_score_thresholdyield the same hits and identicalmotif_scorevalues. The output schema (including themotif_scorecolumn) matchesAAWindowSampler.sample_motif_matched().Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Rows are split into positive and candidate rows bypos_col; candidate rows feed the FIMO scan.pos_col (str, default='pos') – Column with per-row 1-based positive positions; non-empty cells mark positive rows (excluded from the scan), empty /
None/NaNcells mark candidate rows.n (int, default=100) – Maximum number of motif-matched windows to return.
window_size (int, default=9) – Window length; must equal the first dimension of
motif_pwm.motif_pwm (pd.DataFrame) – Position-weight matrix of shape
(window_size, 20)whose columns are the 20 canonical AA letters in any order (reindexed internally tout.LIST_CANONICAL_AA). Required.motif_score_threshold (float) – Score threshold (sum of per-position PWM values). Required.
label_test (int or float, default=1) – Label assigned to positives in
output_mode='sequences'.label_ref (int or float, default=0) – Label assigned to sampled motif-matched rows.
role (str, default='Negative') – Role tag stored in the output’s
rolecolumn.output_mode ({'segments', 'sequences'}, default='segments') – Output schema; same as
AAWindowSampler.sample_motif_matched().max_stored_scores (int, optional) – Maximum number of motif occurrences FIMO may store internally before truncating. FIMO’s default is 100 000; raise this only when scanning very large candidate sets and FIMO reports truncation.
bg_file (str or pathlib.Path, optional) – Path to a MEME-format background amino-acid frequency file. When omitted, FIMO uses its built-in protein background.
motif_pseudo (float, optional) – Pseudocount applied to the motif before scanning (FIMO’s default is
0.1). Pass0.0to disable smoothing.
- Returns:
df_hits – Scored motif hits, one row per matched window.
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If the
fimobinary is not on PATH.ValueError – If
motif_pwmormotif_score_thresholdis not provided, ifbg_fileis set but does not point to an existing file, or ifdf_seqcontains no eligible candidate proteins (rows without test positions).
Notes
Candidate sequences are written to a temporary FASTA and passed to
fimoviasubprocess.The PWM is written in MEME letter-probability format (column order remapped to
ut.STR_MEME_PROTEIN_ALPHABET) andfimoruns in--textmode with--thresh 1.0so every motif occurrence is reported.Each FIMO hit is re-scored with the raw PWM-sum used by
AAWindowSampler.sample_motif_matched(); only positions withscore >= motif_score_thresholdare kept.Surviving hits are ranked by descending score (deterministic tiebreak by
entrythen 0-based center) and capped atn.Protein-only: this wrapper passes the 20 canonical amino acids to MEME as the alphabet; gapped or non-protein alphabets are not supported.
AAWindowSampler’s class-levelmax_similarity_to_test/max_similarity_within_reffilters are not applied by this wrapper (it has no class state); leave those at their defaults for parity.
The wrapper sets
--text,--thresh 1.0, and--no-qvalueunconditionally because they are required for the parity contract withAAWindowSampler.sample_motif_matched(). The other AAanalysis parameters above are passed through to FIMO as follows:See also
MEME Suite documentation and FIMO manual.
AAWindowSampler.sample_motif_matched()for the pure-Python equivalent (no FIMO binary required).
Examples
scan_motif(aaanalysis[pro], requires the MEME Suite) scans sequences for a position-weight-matrix motif via MEME/FIMO and returns the scored hits. This notebook is illustrative; running it requires the MEME Suite on the PATH.import aaanalysis as aa # Requires aaanalysis[pro] + the MEME Suite (FIMO) installed. # df_seq = aa.load_dataset(name="DOM_GSEC", n=20) # hits = aa.scan_motif(df_seq=df_seq, ...) # hits.head()