SeqMut.mutate

SeqMut.mutate(df_seq, mutations, df_feat=None, jmd_n_len=10, jmd_c_len=10)[source]

Apply specific point mutations to sequences and (optionally) measure their ΔCPP.

Each row of mutations edits one residue of its entry’s sequence; the mutated sequence and a human-readable mutation label are always returned, and the feature-space change is added when a df_feat is supplied.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers, in the position-based format (sequence, tmd_start, tmd_stop). See SequenceFeature.get_df_parts() for the full df_seq format specification.
mutations (pd.DataFrame, shape (n_mutations, >=3)) – Tidy mutation table with columns entry, pos (1-based position in the full sequence), and to_aa (target amino acid). from_aa is derived and checked.
df_feat (pd.DataFrame, optional) – CPP feature set (output of CPP.run()). If given, the per-mutation ΔCPP (delta_cpp) and shift_score toward the test-class profile are added; when a model is bound to this SeqMut, the model prediction-shift delta_pred is added too.
jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids.
jmd_c_len (int, default=10) – Length of JMD-C in number of amino acids.

Returns:

df_mut – The mutations table augmented with from_aa, mutation ("<from><pos><to>"), sequence_mut (the mutated sequence), and — when df_feat is given — delta_cpp and shift_score (plus delta_pred when a model is bound).

Return type:

pd.DataFrame, shape (n_mutations, n_info)

Examples

:meth:SeqMut.mutate applies specific point mutations from a tidy mutations table (entry, pos, to_aa) and, when df_feat is given, reports each mutation’s ΔCPP and its shift toward the test-class profile.

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
split_kws = sf.get_split_kws()
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, verbose=False)
df_feat = cpp.run(labels=labels, n_filter=25)
import pandas as pd

# A proline scan across the first TMD positions of one protein.
entry = df_seq["entry"].iloc[0]
start = int(df_seq.set_index("entry").loc[entry, "tmd_start"])
mutations = pd.DataFrame({"entry": entry, "pos": range(start, start + 12), "to_aa": "P"})
seqmut = aa.SeqMut()
aa.display_df(seqmut.mutate(df_seq=df_seq, mutations=mutations, df_feat=df_feat), n_rows=10, show_shape=True)

DataFrame shape: (12, 8)

	entry	pos	to_aa	from_aa	mutation	sequence_mut	delta_cpp	shift_score
1	Q14802	37	P	L	L37P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.167000	0.167000
2	Q14802	38	P	Q	Q38P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
3	Q14802	39	P	V	V39P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.160000	-0.160000
4	Q14802	40	P	G	G40P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
5	Q14802	41	P	G	G41P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.181000	0.181000
6	Q14802	42	P	L	L42P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
7	Q14802	43	P	I	I43P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.179750	-0.179750
8	Q14802	44	P	C	C44P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.248320	0.248320
9	Q14802	45	P	A	A45P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
10	Q14802	46	P	G	G46P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000

jmd_n_len and jmd_c_len set the JMD-N / JMD-C lengths (in residues) used to place the mutated positions into the JMD-N / TMD / JMD-C parts and to build the CPP feature matrix the ΔCPP is measured over. They must match the split geometry used to compute df_feat.

# jmd_n_len / jmd_c_len match the default 10/10 split geometry used for df_feat
df_mut = seqmut.mutate(df_seq=df_seq, mutations=mutations, df_feat=df_feat,
                       jmd_n_len=10, jmd_c_len=10)
aa.display_df(df_mut, n_rows=10, show_shape=True)

DataFrame shape: (12, 8)

	entry	pos	to_aa	from_aa	mutation	sequence_mut	delta_cpp	shift_score
1	Q14802	37	P	L	L37P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.167000	0.167000
2	Q14802	38	P	Q	Q38P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
3	Q14802	39	P	V	V39P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.160000	-0.160000
4	Q14802	40	P	G	G40P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
5	Q14802	41	P	G	G41P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.181000	0.181000
6	Q14802	42	P	L	L42P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
7	Q14802	43	P	I	I43P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.179750	-0.179750
8	Q14802	44	P	C	C44P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.248320	0.248320
9	Q14802	45	P	A	A45P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000
10	Q14802	46	P	G	G46P	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.000000	0.000000