Scientific References

AAanalysis Algorithms

[Breimann24a]

Breimann and Frishman (2024a), AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales, Bioinformatics Advances.

[Breimann24b]

Breimann et al. (2024b), AAontology: An ontology of amino acid scales for interpretable machine learning, Journal of Molecular Biology.

[Breimann25]

Breimann and Kamp et al. (2025), Charting γ-secretase substrates by explainable AI, Nature Communications.

[Chen18]

Chen et al. (2018), iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences, Bioinformatics.

Sequence Algorithms

[Li06]

Li W., Godzik A. (2006), Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics.

[Steinegger17]

Steinegger M., Söding J. (2017), MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets, Nature Biotechnology.

[Bailey09]

Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S. (2009), MEME SUITE: tools for motif discovery and searching, Nucleic Acids Research.

[Grant11]

Grant C.E., Bailey T.L., Noble W.S. (2011), FIMO: scanning for occurrences of a given motif, Bioinformatics.

[Tareen20]

Tareen A., Kinney J.B. (2020), Logomaker: beautiful sequence logos in Python, Bioinformatics.

Structure Algorithms

[Kabsch83]

Kabsch W., Sander C. (1983), Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers.

[Touw15]

Touw W.G., Baakman C., Black J., te Beek T.A.H., Krieger E., Joosten R.P., Vriend G. (2015), A series of PDB-related databanks for everyday needs, Nucleic Acids Research.

[Sanner96]

Sanner M.F., Olson A.J., Spehner J.-C. (1996), Reduced surface: an efficient way to compute molecular surfaces, Biopolymers.

[Jumper21]

Jumper J., Evans R., Pritzel A. et al. (2021), Highly accurate protein structure prediction with AlphaFold, Nature.

[Varadi22]

Varadi M., Anyango S., Deshpande M. et al. (2022), AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research.

[Lau23]

Lau A.M., Kandathil S.M., Jones D.T. (2023), Merizo: a rapid and accurate protein domain segmentation method using invariant point attention, Nature Communications.

[Wells24]

Wells J., Hawkins-Hooker A., Bordin N., Sillitoe I., Paige B., Orengo C.A. (2024), Chainsaw: protein domain segmentation with fully convolutional neural networks, Bioinformatics.

[Verwimp25]

Verwimp S., Lavigne R., Lood C., van Noort V. (2025), AFragmenter: schema-free, tuneable protein domain segmentation for AlphaFold protein structures, Bioinformatics.

Machine Learning

[Hastie09]

Hastie, Tibshirani, and Friedman (2009), The Elements of Statistical Learning, Springer.

[MilliganCooper88]

Milligan and Cooper (1988), A study of standardization of variables in cluster analysis, Journal of Classification.

[Eisen98]

Eisen et al. (1998), Cluster analysis and display of genome-wide expression patterns, PNAS.

Prediction Reliability and Uncertainty

[Huellermeier21]

Hüllermeier and Waegeman (2021), Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods, Machine Learning.

[Guo17]

Guo, Pleiss, Sun, and Weinberger (2017), On calibration of modern neural networks, ICML.

[Sahigara12]

Sahigara, Mansouri, Ballabio, Mauri, Consonni, and Todeschini (2012), Comparison of different approaches to define the applicability domain of QSAR models, Molecules.

[Angelopoulos23]

Angelopoulos and Bates (2023), A gentle introduction to conformal prediction and distribution-free uncertainty quantification, Foundations and Trends in Machine Learning.

Positive-Unlabeled Learning

[ElkanNoto08]

Elkan and Noto (2008), Learning classifiers from only positive and unlabeled data, KDD.

[BekkerDavis20]

Bekker and Davis (2020), Learning from positive and unlabeled data: a survey, Machine Learning.

Explainable AI

[Lundberg20]

Lundberg et al. (2020), From local explanations to global understanding with explainable AI for trees, Nature Machine Intelligence.

Protein Design and Engineering

[Deb02]

Deb et al. (2002), A fast and elitist multiobjective genetic algorithm: NSGA-II, IEEE Transactions on Evolutionary Computation.

[Yang19]

Yang, Wu and Arnold (2019), Machine-learning-guided directed evolution for protein engineering, Nature Methods.

[Wittmann21]

Wittmann, Johnston, Wu and Arnold (2021), Advances in machine learning for directed evolution, Current Opinion in Structural Biology.

[Dauparas22]

Dauparas et al. (2022), Robust deep learning-based protein sequence design using ProteinMPNN, Science.

[Watson23]

Watson et al. (2023), De novo design of protein structure and function with RFdiffusion, Nature.

[Yang26]

Yang et al. (2026), The past, present and future of de novo protein design, Nature.

Datasets and Benchmarks

[Cheng06]

Cheng et al. (2006), Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching, Proteins: Structure, Function, Bioinformatics.

[Kawashima08]

Kawashima et al. (2008), AAindex: Amino aid index database, progress report 2008 Nucleic Acids Research.

[Magnan09]

Magnan, Randall, and Baldi (2009), SOLpro: Accurate sequence-based prediction of protein solubility, Bioinformatics.

[Galiez16]

Galiez et al. (2016), VIRALpro: A tool to identify viral capsid and tail sequences, Bioinformatics.

[Song18]

Song et al. (2018), PROSPERous: High-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy, Bioinformatics.

[Shen19]

Shen et al. (2019), Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC, Journal of Theoretical Biology.

[Tang20]

Tang et al. (2020), IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning, Bioinformatics.

[Teng21]

Teng et al. (2021), ReRF-Pred: Predicting amyloidogenic regions of proteins based on pseudo amino acid composition and tripeptide composition, BMC Bioinformatics.

[Yang21]

Yang et al. (2021), Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information, Neural Computation and Applications.

Sampling Strategies

[Boyd10Cascleave]

Boyd et al. (2010), Cascleave: towards more accurate prediction of caspase substrate cleavage sites, Bioinformatics.

[Song12]

Song et al. (2012), PROSPER: an integrated feature-based tool for predicting protease substrate cleavage sites, PLoS ONE.

[Fu14ScreenCap3]

Fu et al. (2014), ScreenCap3: improving prediction of caspase-3 cleavage sites using experimentally verified non-cleavage sites, Proteomics.

[Rawlings16]

Rawlings et al. (2016), Peptidase specificity from the substrate cleavage collection in the MEROPS database and a tool to measure cleavage site conservation, Biochimie.

[Li20Procleave]

Li et al. (2020), Procleave: predicting protease-specific substrate cleavage sites by combining sequence and structural information, Genomics, Proteomics & Bioinformatics.

[LiuDeber99]

Liu L.-P., Deber C.M. (1999), Combining hydrophobicity and helicity: a novel approach to membrane protein structure prediction, Bioorganic & Medicinal Chemistry.

Use Cases

Charting γ-secretase substrates by explainable AI showcases [Breimann25].