SeqOpt: Optimizing sequences by directed evolution

Protein engineering vs. de novo protein design are two distinct paradigms. De novo design builds new proteins from the ground up rather than repurposing an existing one, typically via a structure-first pipeline — RFdiffusion (backbone) → ProteinMPNN (sequence) → AlphaFold (validation) (Yang et al., 2026). Protein engineering instead optimizes an existing protein by iterative mutation and selection (directed evolution); the machine-learning-guided flavour learns a fitness model to prioritize which variants to make (Yang et al., 2019; Wittmann et al., 2021).

SeqOpt (pro) is that ML-guided directed evolution as a multi-objective search: it mutates one wild-type and returns the Pareto front of variants (NSGA-II; a model-bound SeqMut is the fitness engine), read with SeqOptPlot. Here we engineer a “super substrate” for gamma-secretase (GSEC): take a non-substrate transmembrane domain and mutate it to maximize the predicted substrate probability with as few mutations as possible.

import numpy as np, pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import aaanalysis as aa
aa.options["verbose"] = False
aa.plot_settings()

1. Data and a substrate classifier

Load the gamma-secretase dataset and the bundled interpretable CPP feature set, build the feature matrix, and train a simple RandomForest substrate classifier.

df_feat = aa.load_features(name="DOM_GSEC")
df_seq  = aa.load_dataset(name="DOM_GSEC", n=50)
labels  = df_seq["label"].to_list()        # 1 = GSEC substrate
aa.display_df(df_seq, n_rows=5, show_shape=True)
DataFrame shape: (100, 8)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P53801 MAPGVARGPTPYWRL...GLFKEENPYARFENN 0 97 119 RWGVCWVNFE ALIITMSVVGGTLLLGIAICCCC CCRRKRSRKP
5 Q8IUW5 MAPRALPGSAVLAAA...EVPATPVKRERSGTE 0 59 81 NDTGNGHPEY IAYALVPVFFIMGLFGVLICHLL KKKGYRCTTE
sf = aa.SequenceFeature()
X = np.asarray(sf.feature_matrix(features=df_feat["feature"],
                                 df_parts=sf.get_df_parts(df_seq=df_seq),
                                 df_scales=aa.load_scales()), dtype=float)
model = RandomForestClassifier(n_estimators=100, random_state=0).fit(X, labels)
print("training accuracy:", round(model.score(X, labels), 3))
training accuracy: 1.0
/Users/stephanbreimann/Programming/1Packages/wt-seqopt-deap/aaanalysis/feature_engineering/_backend/cpp_run.py:163: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(

2. The design task

Pick a non-substrate as the wild-type. Two objectives: maximize the predicted substrate probability shift (delta_pred) and minimize the number of mutations (n_mut).

wt = df_seq[df_seq["label"] == 0].iloc[[0]].reset_index(drop=True)
p_wt = model.predict_proba(X[df_seq["label"].values == 0][:1])[0, 1]
print("wild-type:", wt["entry"].iloc[0], "| P(substrate) =", round(float(p_wt), 3))
objectives = [("substrate", "max", "delta_pred"), ("parsimony", "min", "n_mut")]
wild-type: Q14802 | P(substrate) = 0.11

3. Run the optimizer

mode="importance" guides mutations by the static feature importance (fast, deterministic); run returns the Pareto front df_pareto (objective columns + non-dominated rank + crowding).

seqopt = aa.SeqOpt(mode="importance", model=model, target_class=1, random_state=42)
df_pareto = seqopt.run(df_seq=wt, df_feat=df_feat, objectives=objectives,
                       pop_size=40, n_gen=25, n_mut_max=6, region="tmd")
aa.display_df(df_pareto, n_rows=10, show_shape=True)
DataFrame shape: (7, 8)
  entry variant n_mut sequence_mut substrate parsimony rank crowding
1 Q14802 0 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0.000000 0.000000 0 inf
2 Q14802 C49V+I54C+I55R+S58R 4 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 28.000000 4.000000 0 inf
3 Q14802 S58R 1 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 14.000000 1.000000 0 0.696429
4 Q14802 C49M+S58R 2 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 25.000000 2.000000 0 0.482143
5 Q14802 C49V+I55R+S58R 3 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 27.000000 3.000000 0 0.160714
6 Q14802 Q38W+C49M+S58R 3 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 27.000000 3.000000 0 0.142857
7 Q14802 C49L+I55R+S58R 3 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 27.000000 3.000000 0 0.000000
df_eval = seqopt.eval(df_pareto=df_pareto)
aa.display_df(df_eval, n_rows=10, show_shape=True)
DataFrame shape: (1, 3)
  hypervolume n_front spread
1 66.000000 7 0.577352

4. Read the trade-off front

Each rank=0 variant is a non-dominated trade-off between substrate gain and mutation count.

aa.SeqOptPlot().pareto_front(df_pareto=df_pareto, x="substrate", y="parsimony")
plt.tight_layout(); plt.show()
../_images/tutorial7_protein_engineering_1_output_11_0.png

5. Did it converge?

The convergence panels show the dominated hypervolume rising, the front spread, and each objective’s best/mean band across generations.

aa.SeqOptPlot().convergence(history=seqopt.history_)
plt.tight_layout(); plt.show()
../_images/tutorial7_protein_engineering_2_output_13_0.png

6. Which mutations won, and how were they built up?

The mutation map shows substitution enrichment across the front (position x amino acid); the genealogy shows the mutational lineage from wild-type to the designed variants.

aa.SeqOptPlot().mutation_map(df_pareto=df_pareto)
plt.tight_layout(); plt.show()
../_images/tutorial7_protein_engineering_3_output_15_0.png
aa.SeqOptPlot().genealogy(df_pareto=df_pareto, front_only=False)
plt.tight_layout(); plt.show()
../_images/tutorial7_protein_engineering_4_output_16_0.png

7. More objectives

Add a third objective — keep the feature profile close to natural (delta_cpp) — and read it with a 3-D front and parallel coordinates. Objectives can also be any callable(sequence) -> float (a scikit/torch model, or a sequence-level tool / web API).

objectives3 = [("substrate", "max", "delta_pred"),
               ("stability", "min", "delta_cpp"),
               ("parsimony", "min", "n_mut")]
df3 = seqopt.run(df_seq=wt, df_feat=df_feat, objectives=objectives3,
                 pop_size=40, n_gen=25, n_mut_max=6, region="tmd")
aa.SeqOptPlot().parallel_coordinates(df_pareto=df3,
    objectives=["substrate", "stability", "parsimony"])
plt.tight_layout(); plt.show()
../_images/tutorial7_protein_engineering_5_output_18_0.png

8. SHAP-guided design

The headline mode="impact" refits a ShapModel every generation (fuzzy labeling) to mutate the residues SHAP deems most important — pass the labeled reference set.

df_shap = aa.SeqOpt(mode="impact", model=model, target_class=1, df_seq_ref=df_seq,
                    labels=labels, random_state=0).run(
    df_seq=wt, df_feat=df_feat, objectives=objectives, pop_size=12, n_gen=4,
    n_mut_max=4, region="tmd")
aa.display_df(df_shap, n_rows=10, show_shape=True)
DataFrame shape: (9, 8)
  entry variant n_mut sequence_mut substrate parsimony rank crowding
1 Q14802 0 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0.000000 0.000000 0 inf
2 Q14802 C49E+S58R 2 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 21.000000 2.000000 0 inf
3 Q14802 S58F 1 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 8.000000 1.000000 0 0.559524
4 Q14802 V56M 1 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 8.000000 1.000000 0 0.440476
5 Q14802 C49E 1 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 8.000000 1.000000 0 0.000000
6 Q14802 C49E+S58R+A59W 3 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 20.000000 3.000000 1 inf
7 Q14802 A50G 1 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0.000000 1.000000 1 inf
8 Q14802 A50G+S58R 2 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 14.000000 2.000000 1 1.000000
9 Q14802 C49Y+A50G+S58R+A59W 4 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 20.000000 4.000000 2 inf

Summary

SeqOpt turned a non-substrate into predicted GSEC substrates along a Pareto front of substrate-gain vs. mutation-count, and SeqOptPlot exposed the trade-offs, convergence, the enriched mutations and their lineage. The evolutionary machinery is a pure-Python re-implementation of NSGA-II (DEAP-equivalent); the design signal comes from your model — any predict_proba classifier or callable(sequence) predictor.