References
AAanalysis Algorithms
Breimann and Frishman (2024a), AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales, Bioinformatics Advances.
Breimann et al. (2024b), AAontology: An ontology of amino acid scales for interpretable machine learning, Journal of Molecular Biology.
Breimann and Kamp et al. (2025), Charting γ-secretase substrates by explainable AI, Nature Communications.
Sequence Algorithms
Li W., Godzik A. (2006), Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics.
Steinegger M., Söding J. (2017), MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets, Nature Biotechnology.
Bailey T.L., Boden M., Buske F.A., Frith M., Grant C.E., Clementi L., Ren J., Li W.W., Noble W.S. (2009), MEME SUITE: tools for motif discovery and searching, Nucleic Acids Research.
Grant C.E., Bailey T.L., Noble W.S. (2011), FIMO: scanning for occurrences of a given motif, Bioinformatics.
Tareen A., Kinney J.B. (2020), Logomaker: beautiful sequence logos in Python, Bioinformatics.
Structure Algorithms
Kabsch W., Sander C. (1983), Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers.
Touw W.G., Baakman C., Black J., te Beek T.A.H., Krieger E., Joosten R.P., Vriend G. (2015), A series of PDB-related databanks for everyday needs, Nucleic Acids Research.
Sanner M.F., Olson A.J., Spehner J.-C. (1996), Reduced surface: an efficient way to compute molecular surfaces, Biopolymers.
Jumper J., Evans R., Pritzel A. et al. (2021), Highly accurate protein structure prediction with AlphaFold, Nature.
Varadi M., Anyango S., Deshpande M. et al. (2022), AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research.
Lau A.M., Kandathil S.M., Jones D.T. (2023), Merizo: a rapid and accurate protein domain segmentation method using invariant point attention, Nature Communications.
Wells J., Hawkins-Hooker A., Bordin N., Sillitoe I., Paige B., Orengo C.A. (2024), Chainsaw: protein domain segmentation with fully convolutional neural networks, Bioinformatics.
Verwimp S., Lavigne R., Lood C., van Noort V. (2025), AFragmenter: schema-free, tuneable protein domain segmentation for AlphaFold protein structures, Bioinformatics.
Machine Learning
Hastie, Tibshirani, and Friedman (2009), The Elements of Statistical Learning, Springer.
Milligan and Cooper (1988), A study of standardization of variables in cluster analysis, Journal of Classification.
Eisen et al. (1998), Cluster analysis and display of genome-wide expression patterns, PNAS.
Positive-Unlabeled Learning
Elkan and Noto (2008), Learning classifiers from only positive and unlabeled data, KDD.
Bekker and Davis (2020), Learning from positive and unlabeled data: a survey, Machine Learning.
Explainable AI
Lundberg et al. (2020), From local explanations to global understanding with explainable AI for trees, Nature Machine Intelligence.
Datasets and Benchmarks
Cheng et al. (2006), Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching, Proteins: Structure, Function, Bioinformatics.
Kawashima et al. (2008), AAindex: Amino aid index database, progress report 2008 Nucleic Acids Research.
Magnan, Randall, and Baldi (2009), SOLpro: Accurate sequence-based prediction of protein solubility, Bioinformatics.
Galiez et al. (2016), VIRALpro: A tool to identify viral capsid and tail sequences, Bioinformatics.
Song et al. (2018), PROSPERous: High-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy, Bioinformatics.
Shen et al. (2019), Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC, Journal of Theoretical Biology.
Tang et al. (2020), IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning, Bioinformatics.
Teng et al. (2021), ReRF-Pred: Predicting amyloidogenic regions of proteins based on pseudo amino acid composition and tripeptide composition, BMC Bioinformatics.
Yang et al. (2021), Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information, Neural Computation and Applications.
Sampling Strategies
Boyd et al. (2010), Cascleave: towards more accurate prediction of caspase substrate cleavage sites, Bioinformatics.
Song et al. (2012), PROSPER: an integrated feature-based tool for predicting protease substrate cleavage sites, PLoS ONE.
Fu et al. (2014), ScreenCap3: improving prediction of caspase-3 cleavage sites using experimentally verified non-cleavage sites, Proteomics.
Rawlings et al. (2016), Peptidase specificity from the substrate cleavage collection in the MEROPS database and a tool to measure cleavage site conservation, Biochimie.
Li et al. (2020), Procleave: predicting protease-specific substrate cleavage sites by combining sequence and structural information, Genomics, Proteomics & Bioinformatics.
Liu L.-P., Deber C.M. (1999), Combining hydrophobicity and helicity: a novel approach to membrane protein structure prediction, Bioorganic & Medicinal Chemistry.