The AAanalysis Ecosystem

Python packages only. AAanalysis (cheat-sheet colors, center) sits between bioinformatics I/O and the general ML, XAI, causal, and design stack (category colors, outside). Package marks use each project's brand colors; the center carries the real AAanalysis logo.

Positioning brief

The honest question for any new tool is whether it reinvents the wheel. AAanalysis does not reinvent Biopython (bioinformatics I/O), scikit-learn (general ML), SHAP (model explanation), or PyTorch / ESM (deep models and embeddings) — it sits between them, consuming their outputs and feeding the next layer. The one place it genuinely overlaps is classical protein-descriptor libraries such as iFeature, iFeatureOmega, propy3, and PyBioMed.

But it is a different axle on the same wheel. Descriptor libraries map a sequence to a large catalogue of generic descriptors and stop. AAanalysis instead frames a biological task (residue, domain, or protein level), constructs a test-versus-reference comparison, and discovers Part × Split × Scale features through Comparative Physicochemical Profiling (CPP): each feature states where on the sequence a signal sits, how that region is read, and which physicochemical property it captures — then ranks them contrastively and explains them at single-residue resolution.

So the claim is deliberately narrow and defensible. AAanalysis is not broader than iFeature or PyBioMed; it is more interpretable, task-aware, position-resolved, and small-data-aware (PU learning, scarce negatives), with built-in biological plots and SHAP-based attribution. It is small but sharp — it cannot win on breadth against the ecosystem giants, only on the interpretability axis that matters for protein prediction.

Two relationships keep the map honest. A complement sits upstream or downstream and exchanges data; a comparison occupies the same functional role and is a benchmark candidate. Most of the ecosystem is complementary — only descriptor libraries are genuine comparisons. Downstream, AAanalysis is positioned to feed ML and optimization and to make XAI-evaluation (Quantus / OpenXAI), causal (DoWhy / EconML), uncertainty (MAPIE) and design (RFdiffusion / ProteinMPNN / PyRosetta) layers biologically readable; these are shown as candidate / future integrations, not current core. See the project-scale figure below for why "small but sharp" is the right framing.

Upstream, single-cell / spatial omics (Scanpy / scverse / AnnData) and proteomics (pyteomics / pyOpenMS) outputs are consumed through thin optional adapters: a marker / differentially-expressed gene or protein set enters as df_seq, and AAanalysis adds the protein-sequence interpretation layer that those workflows stop short of — without reimplementing single-cell or MS analysis. The end-to-end expression→sequence-signature enrichment orchestration (resolve gene→protein, run CPP, narrate the signature) is a downstream application that lives in ProtXplain, not in core.

Master ecosystem diagram

Categories & packages (brand icons)

1 · Biological data & I/O upstream

Biopython BteBiotite bsvbioservices gggget

FASTA/PDB I/O and UniProt/NCBI access. Consumed as df_seq; AAanalysis builds thin adapters, not a parser.

2 · Protein representations upstream

fair-esm transformers (ProtT5) bebio-embeddings PDBBio.PDB

Embeddings and structure become position-aware pseudo-scales via run_num. Named physicochemical scales stay the most directly interpretable layer.

3 · Protein feature descriptors comparison

iFiFeature py3propy3 PBMPyBioMed

The only genuine competitors. They enumerate descriptor vectors; AAanalysis discovers task-aware, position-aware, explanation-ready features. The honest axis to benchmark on.

4 · AAanalysis core layer

CPPCPP SFSequenceFeature AAcAAclust PUdPULearn TreeTreeModel ShapShapModel PlotCPPPlot

Task framing, test-vs-reference construction, Part×Split×Scale feature discovery, PU learning for missing negatives, biological explanation and ΔCPP design steering.

5 · ML / DL models downstream

scikit-learn XGBXGBoost LGBLightGBM PyTorch

Train on the AAanalysis feature matrix. Target: sklearn-compatible transformers so CPP features drop into standard pipelines.

6 · Explainability (XAI) downstream

SHAP CapCaptum LIMLIME DiCDiCE

Methods covered: PDP · SHAP · Integrated Gradients · LIME · Grad-CAM · PFI · Saliency · LRP · Anchors · ALE · DeepLIFT · DiCE · SmoothGrad · CAM · CEM · Shapley values

Generic attribution becomes biologically readable on CPP features: "JMD-C charge pattern raised substrate prediction" rather than "feature_17 +0.21".

8 · XAI evaluation downstream

QntQuantus OXOpenXAI

Scores explanation quality: faithfulness · robustness · localization · complexity · randomization checks

Lets AAanalysis show its CPP and ShapModel attributions are faithful and stable, not just plausible. The difference between a convincing plot and a trustworthy explanation.

9 · Causal inference downstream

DoWhy EconML

PyWhy family: causal graphs · identification · effect estimation · refutation / sensitivity

CPP finds what distinguishes two groups; causal tools test what drives the outcome. Turns correlated discriminative features into refutable causal hypotheses.

7 · Optimization & design downstream

OptOptuna pmopymoo DEAPDEAP

Tune CPP/model settings and run multi-objective design. AAanalysis exposes objective functions and explains why a candidate improved (ΔCPP), rather than replacing the optimizers.

Side branch · Proteomics / MS optional

pytpyteomics omspyopenms

MS data access and peptide mass/charge/pI. Relevant to the flyability use case: predict and explain sequence determinants downstream of MS workflows.

Side branch · Single-cell / spatial omics optional

adAnnData scScanpy / scverse mdMuData

Marker / differentially-expressed gene & protein sets enter as df_seq via thin [omics] adapters (from_anndata / to_anndata); AAanalysis adds the protein-sequence interpretation layer. The expression→sequence-signature enrichment workflow itself is downstream (ProtXplain), not core.

Cross-cutting · Model validation infra

sklearn.metrics MLfMLflow MAPMAPIE

AAanalysis owns the protein-specific protocols: homology-aware splits, same-protein leakage checks, shuffled-label controls, feature stability, per-protein AP.

Comparison matrix

Note on icons: the AAanalysis logo is the real brand mark, extracted from your v1.1 cheat sheet and embedded as an image. Other package marks are recognizable brand-colored emblems recreated in SVG (Biopython, scikit-learn, PyTorch, Hugging Face, SHAP, Meta, Microsoft) or brand-colored tiles for niche scientific packages. The package set is curated to the most central per category; less-central ones (scikit-bio, pysam, pydssp, CatBoost, JAX, Alibi, InterpretML, Ray Tune, Nevergrad, peptides.py) were trimmed for clarity.

Project scale (indicative)

Python packages only. Marks use each project's brand colors; the AAanalysis core uses cheat-sheet colors (charcoal + teal) and the real logo. Only category 3 is a direct comparison.
#	Category	Relationship	Python packages	What AAanalysis adds / why it complements
1	Biological data & I/O	upstream	Biopython, Biotite, bioservices, gget	File/database/structure plumbing; records consumed as `df_seq`.
2	Protein representations	upstream	fair-esm, transformers (ProtT5), bio-embeddings, Bio.PDB	Makes embeddings/structure position-aware via `run_num`; flags lower interpretability of raw dims.
3	Protein feature descriptors	comparison	iFeature, propy3, PyBioMed	The benchmark axis. Task-aware, position-aware, explanation-ready CPP features vs broad enumeration.
4	AAanalysis	core	CPP, SequenceFeature, AAclust, dPULearn, TreeModel, ShapModel, CPPPlot	The interpretable middle layer: framing, reference construction, feature discovery, PU learning, explanation, design.
5	ML / DL models	downstream	scikit-learn, XGBoost, LightGBM, PyTorch	Train on the feature matrix; target is sklearn-pipeline compatibility.
6	Explainability (XAI)	downstream	SHAP, Captum, LIME, DiCE	Makes attribution biologically meaningful; separates group-level importance from per-sample impact.
8	XAI evaluation	downstream	Quantus, OpenXAI	Scores faithfulness/robustness of CPP & ShapModel explanations — trustworthy, not just plausible.
9	Causal inference	downstream	DoWhy, EconML (PyWhy)	Turns correlated discriminative features into refutable causal hypotheses about drivers.
7	Optimization & design	downstream	Optuna, pymoo, DEAP	Exposes objective functions; design as interpretable scoring, filtering, and ΔCPP steering.
–	Proteomics / MS	side branch	pyteomics, pyopenms	Optional peptide/flyability workflows downstream of MS data access.
–	Single-cell / spatial omics	side branch	AnnData, Scanpy / scverse, MuData	Consume marker / DE gene & protein sets as `df_seq` via optional `[omics]` adapters; the enrichment orchestration is downstream (ProtXplain).
–	Model validation	cross-cutting	sklearn.metrics, MLflow, MAPIE	AAanalysis owns protein-specific protocols; uses generic tooling for metrics and tracking.

Approximate GitHub stars and order-of-magnitude source LOC, grouped by ecosystem role (downstream split into ML/DL · XAI · optimization · causal). Numbers are indicative (≈2025) — run cloc and check GitHub for exact values before publishing. Takeaway: AAanalysis is small but sharp, with descriptor libraries as its only direct comparators.

Strategic summary — innovation & positioning

Bottom line. AAanalysis is not innovative because it computes protein features — iFeature, propy3, PyBioMed, AAindex tools and PLM-embedding workflows already do that. It is innovative as a task-aware, contrastive, biologically interpretable protein-feature discovery layer. Its biggest gain is not "more descriptors" but turning small, messy protein datasets into residue-/region-aware, test-versus-reference explanations that biologists can read and ML pipelines can use.

Where it is strongest: small-data protein prediction (~20–500 positives); missing-negative settings (dPULearn turns unlabeled backgrounds into usable references without pretending they are true negatives); the test-vs-reference contrast that mirrors how biologists think (substrates vs non-substrates, detected vs undetected peptides, PTM sites vs non-sites); biological readability of XAI (not “feature_137 is important” but “JMD-C × pattern × charge raises substrate prediction”); and emerging ΔCPP design steering.

Where it should not compete: descriptor breadth (iFeature / PyBioMed), generic ML (scikit-learn / XGBoost / LightGBM), deep learning (PyTorch / ESM / ProtT5), general XAI theory (SHAP / Captum / Alibi / Quantus), MS processing (pyOpenMS / pyteomics), and single-cell / spatial omics analysis (Scanpy / scverse). It integrates with these rather than replacing them — consuming their gene/protein sets and adding interpretable protein-sequence signatures, while the expression→signature enrichment orchestration stays downstream (ProtXplain) — small but sharp.

“Why not just wrap iFeature / PyBioMed and apply CPP?” Yes — and it is a good idea, but as an optional adapter layer, not the core identity. AAanalysis should consume external descriptors and make them task-aware, contrastive, ranked, validated and explainable, rather than re-implement thousands of descriptors. Key nuance: classical descriptors are usually global (one vector per sequence), so to keep AAanalysis’s edge they should be computed per part (Part × Descriptor) or on sliding windows (position × descriptor), with strong filtering to avoid feature explosion. A fitting name is Contrastive Descriptor Profiling; this turns competitors into benchmarks and optional inputs while preserving the unique identity. Best treated as a v1.2 candidate, not core v1.1.

Strongest next steps: a formal “AAanalysis vs descriptor libraries” benchmark (AAC, k-mers, AAindex, iFeature, propy3, PyBioMed, ESM+classifier vs CPP / CPP+dPULearn / CPP+PLM pseudo-scales) on shared datasets and one protocol; sklearn-compatible transformers (CPPTransformer, NumericalCPPTransformer); a protein-specific validation suite (homology-aware splits, shuffled-label controls, feature stability, per-protein AP, PU sanity checks); ecosystem adapter recipes; and keeping design (AAMut / SeqMut) labelled emerging until experimentally validated.