SeqMut.suggest

SeqMut.suggest(df_seq, df_feat, n=10, region=None, to_aa=None, weight=None, jmd_n_len=10, jmd_c_len=10)[source]

Suggest the top mutations that move a sequence toward the desired CPP / model outcome.

Without a bound model, mutations are ranked by shift_score = Sum sign(mean_dif) * ΔX (optionally weighted by a df_feat column), i.e. how strongly they move features in the direction by which the test class differs from the reference class. With a bound model the ranking switches to the model prediction-shift delta_pred (the ML-guided objective), so the suggested mutations are those predicted to raise the target-class score most. This is the single-objective design primitive; combining several mutations into one variant is SeqMut.combine().

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers, in the position-based format (sequence, tmd_start, tmd_stop). See SequenceFeature.get_df_parts() for the full df_seq format specification.
df_feat (pd.DataFrame) – CPP feature set (output of CPP.run()); its signed mean_dif defines the target direction.
n (int, default=10) – Number of top mutations to return.
region (str or list of int, optional) – Restrict the scan (see SeqMut.scan()).
to_aa (list of str, optional) – Substitution alphabet (see SeqMut.scan()).
weight (str, optional) – Optionally weight the shift score by a df_feat column ('feat_importance' or 'abs_auc'). If None, all features contribute equally. Ignored when a model is bound (the ranking then uses delta_pred).
jmd_n_len (int, default=10) – Length of JMD-N in number of amino acids.
jmd_c_len (int, default=10) – Length of JMD-C in number of amino acids.

Returns:

df_suggest – The top-n mutations sorted by descending shift_score — or by descending delta_pred when a model is bound (the table then also carries the model prediction-shift columns).

Return type:

pd.DataFrame, shape (n, 8)

Examples

:meth:SeqMut.suggest returns the top mutations that move a sequence toward the test-class CPP profile, ranked by shift_score (sum sign(mean_dif) * ΔX).

import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
split_kws = sf.get_split_kws()
cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, verbose=False)
df_feat = cpp.run(labels=labels, n_filter=25)

seqmut = aa.SeqMut()
aa.display_df(seqmut.suggest(df_seq=df_seq, df_feat=df_feat, n=10, region="tmd"), n_rows=10, show_shape=True)

DataFrame shape: (10, 8)

	entry	pos	from_aa	to_aa	mutation	region	delta_cpp	shift_score
1	Q8IUW5	74	G	A	G74A	tmd	3.415670	3.415670
2	P05556	744	G	A	G744A	tmd	3.415670	3.415670
3	Q14802	52	G	A	G52A	tmd	3.415660	3.415660
4	P53801	112	G	A	G112A	tmd	3.415660	3.415660
5	Q8IUW5	74	G	E	G74E	tmd	2.904170	2.904170
6	P05556	744	G	E	G744E	tmd	2.904170	2.904170
7	Q14802	52	G	E	G52E	tmd	2.904160	2.904160
8	P53801	112	G	E	G112E	tmd	2.904160	2.904160
9	Q8IUW5	78	C	A	C78A	tmd	2.859590	2.859590
10	P01135	118	C	A	C118A	tmd	2.859580	2.859580

to_aa restricts the substitution alphabet, weight scales the shift score by a df_feat column ('abs_auc' or 'feat_importance') so more discriminative features count more, and jmd_n_len / jmd_c_len set the JMD lengths of the split geometry.

# Weighted suggestion over a restricted alphabet on the default 10/10 split geometry
df_suggest = seqmut.suggest(df_seq=df_seq, df_feat=df_feat, n=10, region="tmd",
                            to_aa=["A", "L", "V", "P"], weight="abs_auc",
                            jmd_n_len=10, jmd_c_len=10)
aa.display_df(df_suggest, n_rows=10, show_shape=True)

DataFrame shape: (10, 8)

	entry	pos	from_aa	to_aa	mutation	region	delta_cpp	shift_score
1	Q8IUW5	74	G	A	G74A	tmd	3.415670	1.557450
2	P05556	744	G	A	G744A	tmd	3.415670	1.557450
3	P53801	112	G	A	G112A	tmd	3.415660	1.557446
4	Q14802	52	G	A	G52A	tmd	3.415660	1.557446
5	Q8IUW5	78	C	A	C78A	tmd	2.859590	1.300775
6	P01135	118	C	A	C118A	tmd	2.859580	1.300771
7	Q969W9	60	C	A	C60A	tmd	2.859580	1.300771
8	P53801	116	C	A	C116A	tmd	2.859580	1.300771
9	P53801	112	G	L	G112L	tmd	2.801250	1.278101
10	P05556	744	G	L	G744L	tmd	2.801250	1.278101