References

AAanalysis Algorithms

[Breimann24a]

Breimann and Frishman (2024a), AAclust: k-optimized clustering for selecting redundancy-reduced sets of amino acid scales, Bioinformatics Advances.

[Breimann24b]

Breimann et al. (2024b), AAontology: An ontology of amino acid scales for interpretable machine learning, Journal of Molecular Biology.

[Breimann25a]

Breimann and Kamp et al. (2025), Charting γ-secretase substrates by explainable AI, Nature Communications.

Sequence Algorithms

[Li06]

Li W., Godzik A. (2006), Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Bioinformatics.

[Steinegger17]

Steinegger M., Söding J. (2017), MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets, Nature Biotechnology.

Machine Learning

[Hastie09]

Hastie, Tibshirani, and Friedman (2009), The Elements of Statistical Learning, Springer.

Positive-Unlabeled Learning

[BekkerDavis20]

Bekker and Davis (2020), Learning from positive and unlabeled data: a survey, Machine Learning.

Explainable AI

[Lundberg20]

Lundberg et al. (2020), From local explanations to global understanding with explainable AI for trees, Nature Machine Intelligence.

Datasets and Benchmarks

[Cheng06]

Cheng et al. (2006), Large-scale prediction of disulphide bridges using kernel methods, two-dimensional recursive neural networks, and weighted graph matching, Proteins: Structure, Function, Bioinformatics.

[Kawashima08]

Kawashima et al. (2008), AAindex: Amino aid index database, progress report 2008 Nucleic Acids Research.

[Magnan09]

Magnan, Randall, and Baldi (2009), SOLpro: Accurate sequence-based prediction of protein solubility, Bioinformatics.

[Galiez16]

Galiez et al. (2016), VIRALpro: A tool to identify viral capsid and tail sequences, Bioinformatics.

[Song18]

Song et al. (2018), PROSPERous: High-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy, Bioinformatics.

[Shen19]

Shen et al. (2019), Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou’s general PseAAC, Journal of Theoretical Biology.

[Tang20]

Tang et al. (2020), IDP-Seq2Seq: Identification of intrinsically disordered regions based on sequence to sequence learning, Bioinformatics.

[Teng21]

Teng et al. (2021), ReRF-Pred: Predicting amyloidogenic regions of proteins based on pseudo amino acid composition and tripeptide composition, BMC Bioinformatics.

[Yang21]

Yang et al. (2021), Granular multiple kernel learning for identifying RNA-binding protein residues via integrating sequence and structure information, Neural Computation and Applications.