SeqMut.combine

SeqMut.combine(df_seq, variants, df_feat, jmd_n_len=10, jmd_c_len=10)[source]

Score combined (multi-mutation) variants by applying their mutations together.

Each variant groups several point mutations that are applied to the same sequence, yielding one combined sequence whose ΔCPP (and, with a bound model, prediction shift delta_pred) is measured once. This is how 2-3 mutations are combined, in contrast to SeqMut.mutate(), which scores every point mutation independently against the wild-type.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers, in the position-based format (sequence, tmd_start, tmd_stop). See SequenceFeature.get_df_parts() for the full df_seq format specification.
variants (pd.DataFrame, shape (n_mutations, >=4)) – Tidy table with columns entry, variant (a grouping id), pos (1-based) and to_aa. Rows sharing the same entry and variant are applied together as one combined variant; from_aa is derived and checked, and a variant must mutate distinct positions.
df_feat (pd.DataFrame) – CPP feature set (output of CPP.run()) defining the features measured over.
jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids.
jmd_c_len (int, default=10) – Length of JMD-C in number of amino acids.

Returns:

df_variant – One row per combined variant with entry, variant (the '+'-joined single mutations, e.g. "R20K+K27P"), n_mut, sequence_mut, delta_cpp and shift_score — plus delta_pred when a model is bound — sorted by descending delta_pred (model) or shift_score (model-free).

Return type:

pd.DataFrame, shape (n_variants, 6)

Examples

Binding a fitted model makes :class:SeqMut model-aware: each scored mutation carries delta_pred — the change of the model prediction score (percentage points) it induces. :meth:SeqMut.combine applies a named set of point mutations to the same sequence and scores each combined variant once, so 2-3 mutations are evaluated together (unlike :meth:SeqMut.mutate, which scores each independently).

import itertools
import pandas as pd
import matplotlib.pyplot as plt
import aaanalysis as aa
aa.options["verbose"] = False

# Data, CPP features, and a fitted TreeModel that scores each sequence
df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
split_kws = sf.get_split_kws()
df_scales = aa.load_scales()
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales, verbose=False)
df_feat = cpp.run(labels=labels, n_filter=25)
X = sf.feature_matrix(features=list(df_feat["feature"]), df_parts=df_parts, df_scales=df_scales)
tm = aa.TreeModel().fit(X, labels=labels)
entry = df_seq["entry"].iloc[0]
ts = int(df_seq.set_index("entry").loc[entry, "tmd_start"])

seqm = aa.SeqMut(model=tm)
variants = pd.DataFrame({
    "entry": [entry] * 5,
    "variant": ["double", "double", "triple", "triple", "triple"],
    "pos": [ts, ts + 1, ts, ts + 2, ts + 4],
    "to_aa": ["A", "P", "A", "K", "L"],
})
df_variant = seqm.combine(df_seq=df_seq, variants=variants, df_feat=df_feat)
aa.display_df(df_variant, n_rows=10, show_shape=True)

DataFrame shape: (2, 7)

	entry	variant	n_mut	sequence_mut	delta_cpp	shift_score	delta_pred
1	Q14802	L37A+V39K+G41L	3	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.326320	-0.055680	4.500000
2	Q14802	L37A+Q38P	2	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.066000	0.066000	2.000000

The junction-membrane-domain lengths jmd_n_len / jmd_c_len set how many flanking residues are used when rebuilding the sequence parts.

df_variant2 = seqm.combine(df_seq=df_seq, variants=variants, df_feat=df_feat,
                            jmd_n_len=10, jmd_c_len=10)
aa.display_df(df_variant2, n_rows=10, show_shape=True)

DataFrame shape: (2, 7)

	entry	variant	n_mut	sequence_mut	delta_cpp	shift_score	delta_pred
1	Q14802	L37A+V39K+G41L	3	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.326320	-0.055680	4.500000
2	Q14802	L37A+Q38P	2	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0.066000	0.066000	2.000000