SeqMut.scan

SeqMut.scan(df_seq, df_feat, region=None, to_aa=None, jmd_n_len=10, jmd_c_len=10)[source]

Run an exhaustive single-position mutational scan and rank mutations by |ΔCPP|.

For every scannable position and every substitution, the change in the CPP feature vector is measured and aggregated into delta_cpp (the L1 magnitude Sum|ΔX|).

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers, in the position-based format (sequence, tmd_start, tmd_stop). See SequenceFeature.get_df_parts() for the full df_seq format specification.
df_feat (pd.DataFrame) – CPP feature set (output of CPP.run()) defining which features ΔCPP is measured over.
region (str or list of int, optional) – Restrict the scan: None covers the full JMD-N + TMD + JMD-C span, a part name ('jmd_n' / 'tmd' / 'jmd_c') restricts to that part, and a list restricts to those 1-based positions.
to_aa (list of str, optional) – Substitution alphabet. If None, every canonical amino acid (except the wild-type residue) is tried at each position.
jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids.
jmd_c_len (int, default=10) – Length of JMD-C in number of amino acids.

Returns:

df_scan – Tidy mutation landscape with columns entry, pos, from_aa, to_aa, mutation, region, delta_cpp, and shift_score, sorted by descending delta_cpp. When a model is bound to this SeqMut, the model prediction-shift columns delta_pred (ΔP, percentage points), wt_pred and wt_pred_std are appended — this is the data behind the mutation-scan heatmap.

Return type:

pd.DataFrame, shape (n_mutations, 8)

Examples

SeqMut measures the model-free change a mutation induces in a set of CPP features (ΔCPP). We first build a feature set with a small CPP run, then :meth:SeqMut.scan enumerates every TMD substitution and ranks them by delta_cpp (the L1 magnitude of the feature change).

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
split_kws = sf.get_split_kws()
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, verbose=False)
df_feat = cpp.run(labels=labels, n_filter=25)

seqm = aa.SeqMut()
df_scan = seqm.scan(df_seq=df_seq, df_feat=df_feat, region="tmd")
aa.display_df(df_scan, n_rows=10, show_shape=True)

DataFrame shape: (8740, 8)

	entry	pos	from_aa	to_aa	mutation	region	delta_cpp	shift_score
1	P16070	669	A	P	A669P	tmd	4.046420	-3.934420
2	P16070	665	A	P	A665P	tmd	3.975670	-3.863670
3	P09803	730	L	P	L730P	tmd	3.523250	-3.355250
4	Q03157	604	L	P	L604P	tmd	3.523250	-3.355250
5	P05556	748	L	P	L748P	tmd	3.523250	-3.355250
6	Q06481	713	L	P	L713P	tmd	3.523250	-3.355250
7	P05067	720	L	P	L720P	tmd	3.523250	-3.355250
8	P16070	669	A	G	A669G	tmd	3.422670	-3.422670
9	P70180	492	L	P	L492P	tmd	3.417250	-3.249250
10	P01135	114	L	P	L114P	tmd	3.417250	-3.249250

region can be a part name ('jmd_n' / 'tmd' / 'jmd_c'), None (the full JMD-N + TMD + JMD-C span), or a list of 1-based positions; to_aa restricts the substitution alphabet.

# to_aa restricts the substitution alphabet (here: a hydrophobic + proline set);
# jmd_n_len / jmd_c_len set the JMD lengths used to classify each position's region
df_scan_sub = seqm.scan(df_seq=df_seq, df_feat=df_feat,
                          region="tmd", to_aa=["A", "L", "V", "P"],
                          jmd_n_len=10, jmd_c_len=10)
aa.display_df(df_scan_sub, n_rows=10, show_shape=True)

DataFrame shape: (1593, 8)

	entry	pos	from_aa	to_aa	mutation	region	delta_cpp	shift_score
1	P16070	669	A	P	A669P	tmd	4.046420	-3.934420
2	P16070	665	A	P	A665P	tmd	3.975670	-3.863670
3	Q03157	604	L	P	L604P	tmd	3.523250	-3.355250
4	P09803	730	L	P	L730P	tmd	3.523250	-3.355250
5	Q06481	713	L	P	L713P	tmd	3.523250	-3.355250
6	P05556	748	L	P	L748P	tmd	3.523250	-3.355250
7	P05067	720	L	P	L720P	tmd	3.523250	-3.355250
8	P16234	542	L	P	L542P	tmd	3.417250	-3.249250
9	P70180	492	L	P	L492P	tmd	3.417250	-3.249250
10	P01135	114	L	P	L114P	tmd	3.417250	-3.249250