AAanalysis

Ecosystem

Python packages only. AAanalysis (cheat-sheet colors, center) sits between bioinformatics I/O and the general ML, XAI, causal, and design stack (category colors, outside). Package marks use each project's brand colors; the center carries the real AAanalysis logo.

Positioning brief

The honest question for any new tool is whether it reinvents the wheel. AAanalysis does not reinvent Biopython (bioinformatics I/O), scikit-learn (general ML), SHAP (model explanation), or PyTorch / ESM (deep models and embeddings) — it sits between them, consuming their outputs and feeding the next layer. The one place it genuinely overlaps is classical protein-descriptor libraries such as iFeature, iFeatureOmega, propy3, and PyBioMed.

But it is a different axle on the same wheel. Descriptor libraries map a sequence to a large catalogue of generic descriptors and stop. AAanalysis instead frames a biological task (residue, domain, or protein level), constructs a test-versus-reference comparison, and discovers Part × Split × Scale features through Comparative Physicochemical Profiling (CPP): each feature states where on the sequence a signal sits, how that region is read, and which physicochemical property it captures — then ranks them contrastively and explains them at single-residue resolution.

So the claim is deliberately narrow and defensible. AAanalysis is not broader than iFeature or PyBioMed; it is more interpretable, task-aware, position-resolved, and small-data-aware (PU learning, scarce negatives), with built-in biological plots and SHAP-based attribution. It is small but sharp — it cannot win on breadth against the ecosystem giants, only on the interpretability axis that matters for protein prediction.

Two relationships keep the map honest. A complement sits upstream or downstream and exchanges data; a comparison occupies the same functional role and is a benchmark candidate. Most of the ecosystem is complementary — only descriptor libraries are genuine comparisons. Downstream, AAanalysis is positioned to feed ML and optimization and to make XAI-evaluation (Quantus / OpenXAI), causal (DoWhy / EconML), uncertainty (MAPIE) and design (RFdiffusion / ProteinMPNN / PyRosetta) layers biologically readable; these are shown as candidate / future integrations, not current core. See the project-scale figure below for why "small but sharp" is the right framing.

Master ecosystem diagram

UP The AAanalysis Ecosystem Python packages only · the interpretable protein-feature & workflow layer between bioinformatics I/O and ML / XAI / causal / design UPSTREAM COMPLEMENTS · data in 1 · Biological data, I/O & fetch BiopythonBtBiotite bsbioservicesgggget UniProtsbscikit-bio FASTA / PDB I/O · UniProt · NCBI · PDB · gget / bioservices 2 · Protein representations fair-esmtransformers bebio-embeddingsAlphaFold DB PDBBio.PDBT5ProtT5 embeddings · structure · PTM / site → pseudo-scales 4 · CORE LAYER — interpretable protein-feature layer v1.1 sequence → physicochemical scales → interpretable features → explainable ML CPP · Part × Split × Scale · AAontology · numerical CPP (run_num) SequenceFeature · AAclust · dPULearn · TreeModel · ShapModel CPPPlot · ΔCPP design steering (AAMut · SeqMut · emerging) where on the sequence × how to read it × which physicochemical property vs 3 · Feature engineering input + benchmark iFiFeaturep3propy3PBPyBioMed AAanalysissingle-residue · contrastive · explained 10 · Proteomics / MSoptional ptpyteomicsompyopenmsAPAlphaPept CPP → detectability · flyability ·ionization · charge · retention DOWNSTREAM COMPLEMENTS · consume the interpretable feature matrix, explanations & objectives 5 · ML / DL modelsscikit-learnXGBXGBoostLGBLightGBMPyTorchcompatible · trained on matrix X6 · OptimizationOpOptunapmpymooDEDEAPBoBoTorch / Axcandidate · ΔCPP objectives7 · Protein designPRPyRosettaPMProteinMPNNRFRFdiffusionESM-IFcandidate · score / steer ΔCPP9 · XAI evaluationQtQuantusOXOpenXAIfaithfulness · robustness ·localization · complexitycandidate · benchmark explanations 8 · Explainability (XAI) · method taxonomy implemented in AAanalysis candidate to adopt from other packages METHOD CATEGORY AIM METHOD EXAMPLES STATUS Feature attributionlocal & global importanceSHAP · PFI · PDP · ICE · ALE · LIME · LRP · IntGrad · Grad-CAMShapModel Example-basedrepresentative & counterfactualDiMMD-CRITIC · Wachter · DiCE · CEMto adopt Rule-extractionlogical IF-THEN rulesAbANCHOR · RULE-FIT · LORE · TREPANto adopt Neural methodsexplain DL internalsLRP · IntGrad · Grad-CAM · TCAV · GNNExplainer · XGNNCaptum Non-post-hoc / complementary trust layers Surrogate modelsdistill black-boxtrain white-box on black-box predictionsto adopt Uncertainty estimationprediction confidenceMconformal prediction · MAPIEMAPIE Causal modellingcause → effectDoWhy · EconML (PyWhy)DoWhy / EconML Already in AAanalysis: ShapModel (SHAP) & TreeModel importance — the rest are candidates to adopt; CPP makes them biologically readable. CROSS-CUTTING · Model validation (protein-specific protocols) homology-aware splits · same/different-protein splits · shuffled-label controls · feature stability · per-protein AP · PU-label sanity checks | tracking: MLflow Relationship to AAanalysis Upstream complement (data in) Downstream complement (models · XAI · causal · design) Direct comparison / benchmark Optional side branch Core message: AAanalysis is the interpretable middle band — it complements the stack around it, and competes only with classical descriptor libraries, on interpretability and task-awareness rather than breadth. Maturityimplementedcandidate / futureoptional bridge

Categories & packages (brand icons)

1 · Biological data & I/O upstream

Biopython BteBiotite bsvbioservices gggget

FASTA/PDB I/O and UniProt/NCBI access. Consumed as df_seq; AAanalysis builds thin adapters, not a parser.

2 · Protein representations upstream

fair-esm transformers (ProtT5) bebio-embeddings PDBBio.PDB

Embeddings and structure become position-aware pseudo-scales via run_num. Named physicochemical scales stay the most directly interpretable layer.

3 · Protein feature descriptors comparison

iFiFeature py3propy3 PBMPyBioMed

The only genuine competitors. They enumerate descriptor vectors; AAanalysis discovers task-aware, position-aware, explanation-ready features. The honest axis to benchmark on.

AA 4 · AAanalysis core layer

CPPCPP SFSequenceFeature AAcAAclust PUdPULearn TreeTreeModel ShapShapModel PlotCPPPlot

Task framing, test-vs-reference construction, Part×Split×Scale feature discovery, PU learning for missing negatives, biological explanation and ΔCPP design steering.

5 · ML / DL models downstream

scikit-learn XGBXGBoost LGBLightGBM PyTorch

Train on the AAanalysis feature matrix. Target: sklearn-compatible transformers so CPP features drop into standard pipelines.

6 · Explainability (XAI) downstream

SHAP CapCaptum LIMLIME DiCDiCE

Methods covered: PDP · SHAP · Integrated Gradients · LIME · Grad-CAM · PFI · Saliency · LRP · Anchors · ALE · DeepLIFT · DiCE · SmoothGrad · CAM · CEM · Shapley values

Generic attribution becomes biologically readable on CPP features: "JMD-C charge pattern raised substrate prediction" rather than "feature_17 +0.21".

8 · XAI evaluation downstream

QntQuantus OXOpenXAI

Scores explanation quality: faithfulness · robustness · localization · complexity · randomization checks

Lets AAanalysis show its CPP and ShapModel attributions are faithful and stable, not just plausible. The difference between a convincing plot and a trustworthy explanation.

9 · Causal inference downstream

DoWhy EconML

PyWhy family: causal graphs · identification · effect estimation · refutation / sensitivity

CPP finds what distinguishes two groups; causal tools test what drives the outcome. Turns correlated discriminative features into refutable causal hypotheses.

7 · Optimization & design downstream

OptOptuna pmopymoo DEAPDEAP

Tune CPP/model settings and run multi-objective design. AAanalysis exposes objective functions and explains why a candidate improved (ΔCPP), rather than replacing the optimizers.

Side branch · Proteomics / MS optional

pytpyteomics omspyopenms

MS data access and peptide mass/charge/pI. Relevant to the flyability use case: predict and explain sequence determinants downstream of MS workflows.

Cross-cutting · Model validation infra

sklearn.metrics MLfMLflow MAPMAPIE

AAanalysis owns the protein-specific protocols: homology-aware splits, same-protein leakage checks, shuffled-label controls, feature stability, per-protein AP.

Comparison matrix

#CategoryRelationshipPython packagesWhat AAanalysis adds / why it complements
1Biological data & I/OupstreamBiopython, Biotite, bioservices, ggetFile/database/structure plumbing; records consumed as df_seq.
2Protein representationsupstreamfair-esm, transformers (ProtT5), bio-embeddings, Bio.PDBMakes embeddings/structure position-aware via run_num; flags lower interpretability of raw dims.
3Protein feature descriptorscomparisoniFeature, propy3, PyBioMedThe benchmark axis. Task-aware, position-aware, explanation-ready CPP features vs broad enumeration.
4AAanalysiscoreCPP, SequenceFeature, AAclust, dPULearn, TreeModel, ShapModel, CPPPlotThe interpretable middle layer: framing, reference construction, feature discovery, PU learning, explanation, design.
5ML / DL modelsdownstreamscikit-learn, XGBoost, LightGBM, PyTorchTrain on the feature matrix; target is sklearn-pipeline compatibility.
6Explainability (XAI)downstreamSHAP, Captum, LIME, DiCEMakes attribution biologically meaningful; separates group-level importance from per-sample impact.
8XAI evaluationdownstreamQuantus, OpenXAIScores faithfulness/robustness of CPP & ShapModel explanations — trustworthy, not just plausible.
9Causal inferencedownstreamDoWhy, EconML (PyWhy)Turns correlated discriminative features into refutable causal hypotheses about drivers.
7Optimization & designdownstreamOptuna, pymoo, DEAPExposes objective functions; design as interpretable scoring, filtering, and ΔCPP steering.
Proteomics / MSside branchpyteomics, pyopenmsOptional peptide/flyability workflows downstream of MS data access.
Model validationcross-cuttingsklearn.metrics, MLflow, MAPIEAAanalysis owns protein-specific protocols; uses generic tooling for metrics and tracking.
Python packages only. Marks use each project's brand colors; the AAanalysis core uses cheat-sheet colors (charcoal + teal) and the real logo. Only category 3 is a direct comparison.
One-line positioning. AAanalysis turns protein sequences and per-residue numerical features into interpretable, task-aware CPP features for small-data protein prediction, explanation, and design steering — complementing Biopython, scikit-learn, SHAP, Quantus, DoWhy, and the optimizers, and comparing directly only with descriptor libraries such as iFeature and propy.

Note on icons: the AAanalysis logo is the real brand mark, extracted from your v1.1 cheat sheet and embedded as an image. Other package marks are recognizable brand-colored emblems recreated in SVG (Biopython, scikit-learn, PyTorch, Hugging Face, SHAP, Meta, Microsoft) or brand-colored tiles for niche scientific packages. The package set is curated to the most central per category; less-central ones (scikit-bio, pysam, pydssp, CatBoost, JAX, Alibi, InterpretML, Ray Tune, Nevergrad, peptides.py) were trimmed for clarity.

Project scale (indicative)

Approximate GitHub stars and order-of-magnitude source LOC, grouped by ecosystem role (downstream split into ML/DL · XAI · optimization · causal). Numbers are indicative (≈2025) — run cloc and check GitHub for exact values before publishing. Takeaway: AAanalysis is small but sharp, with descriptor libraries as its only direct comparators.

Project scale — indicative GitHub stars & source LOC (approx., grouped by ecosystem role) · order-of-magnitude (≈2025) — run cloc / check GitHub for exact values 10 100 1k 10k 100k PACKAGE ★ GitHub stars (log scale, approx.) ~ SOURCE LOC Core layer · interpretable protein-feature workflow AAanalysis ~80 ~30k Alternatives · classical descriptor libraries (direct comparison) iF iFeature ~300 ~15k iO iFeatureOmega ~120 ~20k p3 propy3 ~120 ~10k PB PyBioMed ~280 ~50k Upstream · data, I/O & protein representations Biopython ~4.5k ~80k sb scikit-bio ~950 ~30k fair-esm ~3.6k ~5k transformers ~135k ~500k AlphaFold ~13k ~30k Downstream · ML / DL models scikit-learn ~61k ~250k PyTorch ~85k ~3,000k XGB XGBoost ~26k ~70k LGB LightGBM ~17k ~40k Downstream · Explainability (XAI) & evaluation SHAP ~23k ~50k Cp Captum ~5k ~30k Qt Quantus ~600 ~15k Downstream · Optimization & design Op Optuna ~11k ~40k DE DEAP ~6k ~15k Ng Nevergrad ~4k ~40k pm pymoo ~2.4k ~30k Downstream · Causal inference DoWhy ~7k ~30k EconML ~4k ~30k Optional side branch · proteomics / MS pt pyteomics ~120 ~20k om pyOpenMS ~500 ~300k AP AlphaPept ~1.5k ~30k AAanalysis is small but sharp: it doesn't match Biopython in breadth, scikit-learn / PyTorch in ML, or SHAP in general XAI — its only direct comparators are descriptor libraries (rose); there it competes on interpretability & task-awareness, not descriptor count.

Strategic summary — innovation & positioning

Bottom line. AAanalysis is not innovative because it computes protein features — iFeature, propy3, PyBioMed, AAindex tools and PLM-embedding workflows already do that. It is innovative as a task-aware, contrastive, biologically interpretable protein-feature discovery layer. Its biggest gain is not "more descriptors" but turning small, messy protein datasets into residue-/region-aware, test-versus-reference explanations that biologists can read and ML pipelines can use.

Where it is strongest: small-data protein prediction (~20–500 positives); missing-negative settings (dPULearn turns unlabeled backgrounds into usable references without pretending they are true negatives); the test-vs-reference contrast that mirrors how biologists think (substrates vs non-substrates, detected vs undetected peptides, PTM sites vs non-sites); biological readability of XAI (not “feature_137 is important” but “JMD-C × pattern × charge raises substrate prediction”); and emerging ΔCPP design steering.

Where it should not compete: descriptor breadth (iFeature / PyBioMed), generic ML (scikit-learn / XGBoost / LightGBM), deep learning (PyTorch / ESM / ProtT5), general XAI theory (SHAP / Captum / Alibi / Quantus), and MS processing (pyOpenMS / pyteomics). It integrates with these rather than replacing them — small but sharp.

“Why not just wrap iFeature / PyBioMed and apply CPP?” Yes — and it is a good idea, but as an optional adapter layer, not the core identity. AAanalysis should consume external descriptors and make them task-aware, contrastive, ranked, validated and explainable, rather than re-implement thousands of descriptors. Key nuance: classical descriptors are usually global (one vector per sequence), so to keep AAanalysis’s edge they should be computed per part (Part × Descriptor) or on sliding windows (position × descriptor), with strong filtering to avoid feature explosion. A fitting name is Contrastive Descriptor Profiling; this turns competitors into benchmarks and optional inputs while preserving the unique identity. Best treated as a v1.2 candidate, not core v1.1.

Strongest next steps: a formal “AAanalysis vs descriptor libraries” benchmark (AAC, k-mers, AAindex, iFeature, propy3, PyBioMed, ESM+classifier vs CPP / CPP+dPULearn / CPP+PLM pseudo-scales) on shared datasets and one protocol; sklearn-compatible transformers (CPPTransformer, NumericalCPPTransformer); a protein-specific validation suite (homology-aware splits, shuffled-label controls, feature stability, per-protein AP, PU sanity checks); ecosystem adapter recipes; and keeping design (AAMut / SeqMut) labelled emerging until experimentally validated.

One line. AAanalysis is most innovative as a bridge: the interpretable, contrastive, task-aware protein-feature layer that makes protein-prediction explanations actionable for biology — which region, which residue pattern, and which physicochemical / pseudo-scale property distinguishes a functional group or drives a prediction.