SeqMut.combine
- SeqMut.combine(df_seq, variants, df_feat, jmd_n_len=10, jmd_c_len=10)[source]
Score combined (multi-mutation) variants by applying their mutations together.
Each variant groups several point mutations that are applied to the same sequence, yielding one combined sequence whose ΔCPP (and, with a bound
model, prediction shiftdelta_pred) is measured once. This is how 2-3 mutations are combined, in contrast toSeqMut.mutate(), which scores every point mutation independently against the wild-type.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers, in the position-based format (sequence,tmd_start,tmd_stop). SeeSequenceFeature.get_df_parts()for the fulldf_seqformat specification.variants (pd.DataFrame, shape (n_mutations, >=4)) – Tidy table with columns
entry,variant(a grouping id),pos(1-based) andto_aa. Rows sharing the sameentryandvariantare applied together as one combined variant;from_aais derived and checked, and a variant must mutate distinct positions.df_feat (pd.DataFrame) – CPP feature set (output of
CPP.run()) defining the features measured over.jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids.
jmd_c_len (int, default=10) – Length of JMD-C in number of amino acids.
- Returns:
df_variant – One row per combined variant with
entry,variant(the'+'-joined single mutations, e.g."R20K+K27P"),n_mut,sequence_mut,delta_cppandshift_score— plusdelta_predwhen a model is bound — sorted by descendingdelta_pred(model) orshift_score(model-free).- Return type:
pd.DataFrame, shape (n_variants, 6)
Examples
Binding a fitted
modelmakes :class:SeqMutmodel-aware: each scored mutation carriesdelta_pred— the change of the model prediction score (percentage points) it induces. :meth:SeqMut.combineapplies a named set of point mutations to the same sequence and scores each combined variant once, so 2-3 mutations are evaluated together (unlike :meth:SeqMut.mutate, which scores each independently).import itertools import pandas as pd import matplotlib.pyplot as plt import aaanalysis as aa aa.options["verbose"] = False # Data, CPP features, and a fitted TreeModel that scores each sequence df_seq = aa.load_dataset(name="DOM_GSEC", n=10) labels = df_seq["label"].to_list() sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) split_kws = sf.get_split_kws() df_scales = aa.load_scales() cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales, verbose=False) df_feat = cpp.run(labels=labels, n_filter=25) X = sf.feature_matrix(features=list(df_feat["feature"]), df_parts=df_parts, df_scales=df_scales) tm = aa.TreeModel().fit(X, labels=labels) entry = df_seq["entry"].iloc[0] ts = int(df_seq.set_index("entry").loc[entry, "tmd_start"]) seqmut = aa.SeqMut(model=tm) variants = pd.DataFrame({ "entry": [entry] * 5, "variant": ["double", "double", "triple", "triple", "triple"], "pos": [ts, ts + 1, ts, ts + 2, ts + 4], "to_aa": ["A", "P", "A", "K", "L"], }) df_variant = seqmut.combine(df_seq=df_seq, variants=variants, df_feat=df_feat) aa.display_df(df_variant, n_rows=10, show_shape=True)
/Users/stephanbreimann/Programming/1Packages/wt-seqmut-ml-guided/aaanalysis/feature_engineering/_backend/cpp_run.py:163: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel. warnings.warn(
DataFrame shape: (2, 7)
entry variant n_mut sequence_mut delta_cpp shift_score delta_pred 1 Q14802 L37A+V39K+G41L 3 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0.326320 -0.055680 4.000000 2 Q14802 L37A+Q38P 2 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0.066000 0.066000 1.700000 The junction-membrane-domain lengths
jmd_n_len/jmd_c_lenset how many flanking residues are used when rebuilding the sequence parts.df_variant2 = seqmut.combine(df_seq=df_seq, variants=variants, df_feat=df_feat, jmd_n_len=10, jmd_c_len=10) aa.display_df(df_variant2, n_rows=10, show_shape=True)
DataFrame shape: (2, 7)
entry variant n_mut sequence_mut delta_cpp shift_score delta_pred 1 Q14802 L37A+V39K+G41L 3 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0.326320 -0.055680 4.000000 2 Q14802 L37A+Q38P 2 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0.066000 0.066000 1.700000