SeqMut.scan

SeqMut.scan(df_seq, df_feat, region=None, to_aa=None, jmd_n_len=10, jmd_c_len=10)[source]

Run an exhaustive single-position mutational scan and rank mutations by |ΔCPP|.

For every scannable position and every substitution, the change in the CPP feature vector is measured and aggregated into delta_cpp (the L1 magnitude Sum|ΔX|).

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers, in the position-based format (sequence, tmd_start, tmd_stop). See SequenceFeature.get_df_parts() for the full df_seq format specification.

  • df_feat (pd.DataFrame) – CPP feature set (output of CPP.run()) defining which features ΔCPP is measured over.

  • region (str or list of int, optional) – Restrict the scan: None covers the full JMD-N + TMD + JMD-C span, a part name ('jmd_n' / 'tmd' / 'jmd_c') restricts to that part, and a list restricts to those 1-based positions.

  • to_aa (list of str, optional) – Substitution alphabet. If None, every canonical amino acid (except the wild-type residue) is tried at each position.

  • jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids.

  • jmd_c_len (int, default=10) – Length of JMD-C in number of amino acids.

Returns:

df_scan – Tidy mutation landscape with columns entry, pos, from_aa, to_aa, mutation, region, delta_cpp, and shift_score, sorted by descending delta_cpp. When a model is bound to this SeqMut, the model prediction-shift columns delta_pred (ΔP, percentage points), wt_pred and wt_pred_std are appended — this is the data behind the mutation-scan heatmap.

Return type:

pd.DataFrame, shape (n_mutations, 8)

Examples

SeqMut measures the model-free change a mutation induces in a set of CPP features (ΔCPP). We first build a feature set with a small CPP run, then :meth:SeqMut.scan enumerates every TMD substitution and ranks them by delta_cpp (the L1 magnitude of the feature change).

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
split_kws = sf.get_split_kws()
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, verbose=False)
df_feat = cpp.run(labels=labels, n_filter=25)

seqmut = aa.SeqMut()
df_scan = seqmut.scan(df_seq=df_seq, df_feat=df_feat, region="tmd")
aa.display_df(df_scan, n_rows=10, show_shape=True)
CPP using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
DataFrame shape: (8740, 8)
  entry pos from_aa to_aa mutation region delta_cpp shift_score
1 P16070 669 A P A669P tmd 4.046420 -3.934420
2 P16070 665 A P A665P tmd 3.975670 -3.863670
3 P09803 730 L P L730P tmd 3.523250 -3.355250
4 Q03157 604 L P L604P tmd 3.523250 -3.355250
5 P05556 748 L P L748P tmd 3.523250 -3.355250
6 Q06481 713 L P L713P tmd 3.523250 -3.355250
7 P05067 720 L P L720P tmd 3.523250 -3.355250
8 P16070 669 A G A669G tmd 3.422670 -3.422670
9 P70180 492 L P L492P tmd 3.417250 -3.249250
10 P01135 114 L P L114P tmd 3.417250 -3.249250

region can be a part name ('jmd_n' / 'tmd' / 'jmd_c'), None (the full JMD-N + TMD + JMD-C span), or a list of 1-based positions; to_aa restricts the substitution alphabet.