Python packages only. AAanalysis (cheat-sheet colors, center) sits between bioinformatics I/O and the general ML, XAI, causal, and design stack (category colors, outside). Package marks use each project's brand colors; the center carries the real AAanalysis logo.
The honest question for any new tool is whether it reinvents the wheel. AAanalysis does not reinvent Biopython (bioinformatics I/O), scikit-learn (general ML), SHAP (model explanation), or PyTorch / ESM (deep models and embeddings) — it sits between them, consuming their outputs and feeding the next layer. The one place it genuinely overlaps is classical protein-descriptor libraries such as iFeature, iFeatureOmega, propy3, and PyBioMed.
But it is a different axle on the same wheel. Descriptor libraries map a sequence to a large catalogue of generic descriptors and stop. AAanalysis instead frames a biological task (residue, domain, or protein level), constructs a test-versus-reference comparison, and discovers Part × Split × Scale features through Comparative Physicochemical Profiling (CPP): each feature states where on the sequence a signal sits, how that region is read, and which physicochemical property it captures — then ranks them contrastively and explains them at single-residue resolution.
So the claim is deliberately narrow and defensible. AAanalysis is not broader than iFeature or PyBioMed; it is more interpretable, task-aware, position-resolved, and small-data-aware (PU learning, scarce negatives), with built-in biological plots and SHAP-based attribution. It is small but sharp — it cannot win on breadth against the ecosystem giants, only on the interpretability axis that matters for protein prediction.
Two relationships keep the map honest. A complement sits upstream or downstream and exchanges data; a comparison occupies the same functional role and is a benchmark candidate. Most of the ecosystem is complementary — only descriptor libraries are genuine comparisons. Downstream, AAanalysis is positioned to feed ML and optimization and to make XAI-evaluation (Quantus / OpenXAI), causal (DoWhy / EconML), uncertainty (MAPIE) and design (RFdiffusion / ProteinMPNN / PyRosetta) layers biologically readable; these are shown as candidate / future integrations, not current core. See the project-scale figure below for why "small but sharp" is the right framing.
FASTA/PDB I/O and UniProt/NCBI access. Consumed as df_seq; AAanalysis builds thin adapters, not a parser.
Embeddings and structure become position-aware pseudo-scales via run_num. Named physicochemical scales stay the most directly interpretable layer.
The only genuine competitors. They enumerate descriptor vectors; AAanalysis discovers task-aware, position-aware, explanation-ready features. The honest axis to benchmark on.
Task framing, test-vs-reference construction, Part×Split×Scale feature discovery, PU learning for missing negatives, biological explanation and ΔCPP design steering.
Train on the AAanalysis feature matrix. Target: sklearn-compatible transformers so CPP features drop into standard pipelines.
Generic attribution becomes biologically readable on CPP features: "JMD-C charge pattern raised substrate prediction" rather than "feature_17 +0.21".
Lets AAanalysis show its CPP and ShapModel attributions are faithful and stable, not just plausible. The difference between a convincing plot and a trustworthy explanation.
CPP finds what distinguishes two groups; causal tools test what drives the outcome. Turns correlated discriminative features into refutable causal hypotheses.
Tune CPP/model settings and run multi-objective design. AAanalysis exposes objective functions and explains why a candidate improved (ΔCPP), rather than replacing the optimizers.
MS data access and peptide mass/charge/pI. Relevant to the flyability use case: predict and explain sequence determinants downstream of MS workflows.
AAanalysis owns the protein-specific protocols: homology-aware splits, same-protein leakage checks, shuffled-label controls, feature stability, per-protein AP.
| # | Category | Relationship | Python packages | What AAanalysis adds / why it complements |
|---|---|---|---|---|
| 1 | Biological data & I/O | upstream | Biopython, Biotite, bioservices, gget | File/database/structure plumbing; records consumed as df_seq. |
| 2 | Protein representations | upstream | fair-esm, transformers (ProtT5), bio-embeddings, Bio.PDB | Makes embeddings/structure position-aware via run_num; flags lower interpretability of raw dims. |
| 3 | Protein feature descriptors | comparison | iFeature, propy3, PyBioMed | The benchmark axis. Task-aware, position-aware, explanation-ready CPP features vs broad enumeration. |
| 4 | AAanalysis | core | CPP, SequenceFeature, AAclust, dPULearn, TreeModel, ShapModel, CPPPlot | The interpretable middle layer: framing, reference construction, feature discovery, PU learning, explanation, design. |
| 5 | ML / DL models | downstream | scikit-learn, XGBoost, LightGBM, PyTorch | Train on the feature matrix; target is sklearn-pipeline compatibility. |
| 6 | Explainability (XAI) | downstream | SHAP, Captum, LIME, DiCE | Makes attribution biologically meaningful; separates group-level importance from per-sample impact. |
| 8 | XAI evaluation | downstream | Quantus, OpenXAI | Scores faithfulness/robustness of CPP & ShapModel explanations — trustworthy, not just plausible. |
| 9 | Causal inference | downstream | DoWhy, EconML (PyWhy) | Turns correlated discriminative features into refutable causal hypotheses about drivers. |
| 7 | Optimization & design | downstream | Optuna, pymoo, DEAP | Exposes objective functions; design as interpretable scoring, filtering, and ΔCPP steering. |
| – | Proteomics / MS | side branch | pyteomics, pyopenms | Optional peptide/flyability workflows downstream of MS data access. |
| – | Model validation | cross-cutting | sklearn.metrics, MLflow, MAPIE | AAanalysis owns protein-specific protocols; uses generic tooling for metrics and tracking. |
Note on icons: the AAanalysis logo is the real brand mark, extracted from your v1.1 cheat sheet and embedded as an image. Other package marks are recognizable brand-colored emblems recreated in SVG (Biopython, scikit-learn, PyTorch, Hugging Face, SHAP, Meta, Microsoft) or brand-colored tiles for niche scientific packages. The package set is curated to the most central per category; less-central ones (scikit-bio, pysam, pydssp, CatBoost, JAX, Alibi, InterpretML, Ray Tune, Nevergrad, peptides.py) were trimmed for clarity.
Approximate GitHub stars and order-of-magnitude source LOC, grouped by ecosystem role (downstream split into ML/DL · XAI · optimization · causal). Numbers are indicative (≈2025) — run cloc and check GitHub for exact values before publishing. Takeaway: AAanalysis is small but sharp, with descriptor libraries as its only direct comparators.
Bottom line. AAanalysis is not innovative because it computes protein features — iFeature, propy3, PyBioMed, AAindex tools and PLM-embedding workflows already do that. It is innovative as a task-aware, contrastive, biologically interpretable protein-feature discovery layer. Its biggest gain is not "more descriptors" but turning small, messy protein datasets into residue-/region-aware, test-versus-reference explanations that biologists can read and ML pipelines can use.
Where it is strongest: small-data protein prediction (~20–500 positives); missing-negative settings (dPULearn turns unlabeled backgrounds into usable references without pretending they are true negatives); the test-vs-reference contrast that mirrors how biologists think (substrates vs non-substrates, detected vs undetected peptides, PTM sites vs non-sites); biological readability of XAI (not “feature_137 is important” but “JMD-C × pattern × charge raises substrate prediction”); and emerging ΔCPP design steering.
Where it should not compete: descriptor breadth (iFeature / PyBioMed), generic ML (scikit-learn / XGBoost / LightGBM), deep learning (PyTorch / ESM / ProtT5), general XAI theory (SHAP / Captum / Alibi / Quantus), and MS processing (pyOpenMS / pyteomics). It integrates with these rather than replacing them — small but sharp.
“Why not just wrap iFeature / PyBioMed and apply CPP?” Yes — and it is a good idea, but as an optional adapter layer, not the core identity. AAanalysis should consume external descriptors and make them task-aware, contrastive, ranked, validated and explainable, rather than re-implement thousands of descriptors. Key nuance: classical descriptors are usually global (one vector per sequence), so to keep AAanalysis’s edge they should be computed per part (Part × Descriptor) or on sliding windows (position × descriptor), with strong filtering to avoid feature explosion. A fitting name is Contrastive Descriptor Profiling; this turns competitors into benchmarks and optional inputs while preserving the unique identity. Best treated as a v1.2 candidate, not core v1.1.
Strongest next steps: a formal “AAanalysis vs descriptor libraries” benchmark (AAC, k-mers, AAindex, iFeature, propy3, PyBioMed, ESM+classifier vs CPP / CPP+dPULearn / CPP+PLM pseudo-scales) on shared datasets and one protocol; sklearn-compatible transformers (CPPTransformer, NumericalCPPTransformer); a protein-specific validation suite (homology-aware splits, shuffled-label controls, feature stability, per-protein AP, PU sanity checks); ecosystem adapter recipes; and keeping design (AAMut / SeqMut) labelled emerging until experimentally validated.