aaanalysis.scan_motif

class aaanalysis.scan_motif(df_seq=None, pos_col='pos', n=100, window_size=9, motif_pwm=None, motif_score_threshold=None, label_test=1, label_ref=0, role='Negative', output_mode='segments', max_stored_scores=None, bg_file=None, motif_pseudo=None)[source]

Bases:

Scan candidate proteins for windows matching a user-supplied PWM (Position Weight Matrix) using the FIMO CLI.

This is a CLI-based wrapper around FIMO [Bailey09], [Grant11] from the MEME suite that mirrors AAWindowSampler.sample_motif_matched() with strict parity on the returned hit set: the same df_seq, motif_pwm, and motif_score_threshold yield the same hits and identical motif_score values. The output schema (including the motif_score column) matches AAWindowSampler.sample_motif_matched().

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Rows are split into positive and candidate rows by pos_col; candidate rows feed the FIMO scan.

  • pos_col (str, default='pos') – Column with per-row 1-based positive positions; non-empty cells mark positive rows (excluded from the scan), empty / None / NaN cells mark candidate rows.

  • n (int, default=100) – Maximum number of motif-matched windows to return.

  • window_size (int, default=9) – Window length; must equal the first dimension of motif_pwm.

  • motif_pwm (pd.DataFrame) – Position-weight matrix of shape (window_size, 20) whose columns are the 20 canonical AA letters in any order (reindexed internally to ut.LIST_CANONICAL_AA). Required.

  • motif_score_threshold (float) – Score threshold (sum of per-position PWM values). Required.

  • label_test (int or float, default=1) – Label assigned to positives in output_mode='sequences'.

  • label_ref (int or float, default=0) – Label assigned to sampled motif-matched rows.

  • role (str, default='Negative') – Role tag stored in the output’s role column.

  • output_mode ({'segments', 'sequences'}, default='segments') – Output schema; same as AAWindowSampler.sample_motif_matched().

  • max_stored_scores (int, optional) – Maximum number of motif occurrences FIMO may store internally before truncating. FIMO’s default is 100 000; raise this only when scanning very large candidate sets and FIMO reports truncation.

  • bg_file (str or pathlib.Path, optional) – Path to a MEME-format background amino-acid frequency file. When omitted, FIMO uses its built-in protein background.

  • motif_pseudo (float, optional) – Pseudocount applied to the motif before scanning (FIMO’s default is 0.1). Pass 0.0 to disable smoothing.

Returns:

df_hits – Scored motif hits, one row per matched window.

Return type:

pd.DataFrame

Raises:
  • RuntimeError – If the fimo binary is not on PATH.

  • ValueError – If motif_pwm or motif_score_threshold is not provided, if bg_file is set but does not point to an existing file, or if df_seq contains no eligible candidate proteins (rows without test positions).

Notes

  • Candidate sequences are written to a temporary FASTA and passed to fimo via subprocess.

  • The PWM is written in MEME letter-probability format (column order remapped to ut.STR_MEME_PROTEIN_ALPHABET) and fimo runs in --text mode with --thresh 1.0 so every motif occurrence is reported.

  • Each FIMO hit is re-scored with the raw PWM-sum used by AAWindowSampler.sample_motif_matched(); only positions with score >= motif_score_threshold are kept.

  • Surviving hits are ranked by descending score (deterministic tiebreak by entry then 0-based center) and capped at n.

  • Protein-only: this wrapper passes the 20 canonical amino acids to MEME as the alphabet; gapped or non-protein alphabets are not supported.

  • AAWindowSampler’s class-level max_similarity_to_test / max_similarity_within_ref filters are not applied by this wrapper (it has no class state); leave those at their defaults for parity.

The wrapper sets --text, --thresh 1.0, and --no-qvalue unconditionally because they are required for the parity contract with AAWindowSampler.sample_motif_matched(). The other AAanalysis parameters above are passed through to FIMO as follows:

See also

Examples

scan_motif (aaanalysis[pro], requires the MEME Suite) scans sequences for a position-weight-matrix motif via MEME/FIMO and returns the scored hits. This notebook is illustrative; running it requires the MEME Suite on the PATH.

import aaanalysis as aa

# Requires aaanalysis[pro] + the MEME Suite (FIMO) installed.
# df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
# hits = aa.scan_motif(df_seq=df_seq, ...)
# hits.head()