AAanalysis — decision map

Decision Map

v1.1 dev-preview · stable on PyPI: 1.0.3

decision / question option AAanalysis box: class / method / function · data (df_*, X) · explanation planned (post v1.1)

%%{init: {"flowchart": {"htmlLabels": true, "curve": "basis", "nodeSpacing": 26, "rankSpacing": 30, "padding": 8, "wrappingWidth": 360}}}%%
flowchart TB
  START(["START
What is your goal?"]):::start

  subgraph EX["① Explore
Sequences & groups"]
    QX{"What do you
want to explore?"}:::dec
    AAM["AAMut
explore substitutions"]:::fn
    Q1{"One or two
groups to analyze?"}:::dec
    AAL["AAlogoPlot.single_logo
one group · conservation"]:::fn
    AALC["AAlogoPlot.multi_logo
CPP.run
compare groups"]:::fn
    QREP{"How to find
representatives?"}:::dec
    REPs["filter_seq
cluster by similarity"]:::fn
    REPn["AAclust.select_proteins
numerical representatives"]:::fn
  end

  subgraph PR["② Build
Engineer features · predict & explain"]
    SAMP{"What is your
prediction task?"}:::dec
    Sc["binary classification (1/0)"]:::rep
    Sr["regression → threshold
SequenceFeature.get_labels_quantile
SequenceFeature.get_labels_tiered"]:::rep
    Sm["multi-class
SequenceFeature.get_labels_ovr
SequenceFeature.get_labels_ovo"]:::rep
    REFQ{"What is your
reference group? (label 0)"}:::dec
    AWS["AAWindowSampler
build reference windows"]:::fn
    DPU["dPULearn
mine reliable negatives → 0"]:::fn
    DSEQ["df_seq + labels"]:::df
    FE(["CPP
feature engineering
→ df_feat"]):::core
    EXQ{"What kind of
explanation?"}:::dec
    EXLVL{"Which level of
explanation?"}:::dec
    CPPP["CPPPlot
feature_map · profile · ranking"]:::fn
    FM["SequenceFeature.feature_matrix
→ X"]:::fn
    MODEL["TreeModel
scikit-learn · PyTorch"]:::fn
    TM["TreeModel
global feature importance"]:::fn
    SM["ShapModel
local feature impact"]:::fn
    XAIP["rule extraction · concept-based ·
example-based · causal modeling · uncertainty
(planned · post v1.1)"]:::plan
  end

  subgraph EN["③ Optimize
Protein engineering"]
    Q4{"have a fitted
predictor?"}:::dec
    Q5{"model-guided or
model-free?"}:::dec
    SO["SeqOpt
directed evolution"]:::fn
    BM["build a model first (→ ②)"]:::rep
    SMu["SeqMut
ΔCPP mutation scan
optional model → Δprediction"]:::fn
  end

  START -->|explore| QX
  START -->|predict| SAMP
  START -->|optimize| Q4

  QX -->|mutations| AAM
  QX -->|"sequences / datasets"| Q1
  QX -->|"representative proteins"| QREP
  Q1 -->|one| AAL
  Q1 -->|two| AALC
  QREP -->|sequence| REPs
  QREP -->|numerical| REPn

  SAMP -->|classification| Sc
  SAMP -->|regression| Sr
  SAMP -->|multi-class| Sm
  Sc -->|"labels"| REFQ
  Sr -->|"labels"| REFQ
  Sm -->|"labels"| REFQ
  REFQ -->|"have negatives"| DSEQ
  REFQ -->|"positives + unlabeled"| DPU
  REFQ -.->|"residue · need decoys"| AWS
  DPU --> DSEQ
  AWS --> DSEQ
  DSEQ --> FE
  FE -->|"predict"| FM
  FM -->|"fit model"| MODEL
  FE -->|"explain"| EXQ
  EXQ -->|"feature attribution"| EXLVL
  EXQ -.->|"others (later)"| XAIP
  EXLVL -->|"global"| TM
  EXLVL -->|"local"| SM
  TM --> CPPP
  SM --> CPPP

  Q4 -->|yes| SO
  Q4 -->|no| Q5
  Q5 -->|"model-guided"| BM
  Q5 -->|"model-free"| SMu
  BM -.->|then| SO

  classDef start fill:#67A1D0,stroke:#3f7bb0,stroke-width:2px,color:#ffffff
  classDef dec fill:#fdebcf,stroke:#e0902c,stroke-width:1.8px,color:#7a4a00
  classDef core fill:#dbe8f6,stroke:#67A1D0,stroke-width:3.5px,color:#333131,font-weight:bold
  classDef fn fill:#ffffff,stroke:#C9C9C9,color:#333131
  classDef df fill:#ffffff,stroke:#C9C9C9,color:#333131,font-weight:bold
  classDef rep fill:#ffffff,stroke:#C9C9C9,color:#333131
  classDef plan fill:#f1f3f5,stroke:#9a9a9a,stroke-dasharray:5 4,color:#9a9a9a
  classDef spacer fill:transparent,stroke:transparent,color:transparent

CPP Feature engineering (Parts × Splits × Scales): How AAanalysis identifies interpretable physicochemical features

%%{init: {"flowchart": {"htmlLabels": true, "curve": "basis", "nodeSpacing": 26, "rankSpacing": 30, "padding": 10, "wrappingWidth": 360}}}%%
flowchart TB
  %% ---- input router · same df_seq as the main map ----
  SEQ["df_seq + labels"]:::df
  INQ{"What are your
input data / settings?"}:::dec
  SEQ --> INQ

  %% ===== column 1 · SCALES (sequence path · the value source) =====
  LS["load_scales"]:::fn
  SCQ{"Which Scale?
(physicochemical representation)"}:::dec
  SALL["load_scales()
All AAontology scales"]:::rep
  SSEL["load_scales(top_explain_n=10)
Interpretable scale sets"]:::rep
  SAAC["AAclust
reduce redundancy"]:::fn
  SCALES["df_scales"]:::df
  INQ -->|"AA scales"| LS
  LS --> SCQ
  SCQ -->|"all"| SALL
  SCQ -->|"curated"| SSEL
  SCQ -->|"custom
redundancy reduced"| SAAC
  SALL --> SCALES
  SSEL --> SCALES
  SAAC --> SCALES

  %% ===== column 2 · PARTS (sequence path) =====
  GDP["SequenceFeature.get_df_parts"]:::fn
  UNIT{"Which Part?
(biological sequence region)"}:::dec
  Uwin["TMD window
(fixed TMD length)"]:::rep
  Utmd["TMD + JMD-N / JMD-C
(flexible TMD length)"]:::rep
  Uwhole["the whole chain"]:::rep
  PARTS["df_parts"]:::df
  INQ -->|"sequences"| GDP
  GDP --> UNIT
  UNIT -->|"residue"| Uwin
  UNIT -->|"domain"| Utmd
  UNIT -->|"protein"| Uwhole
  Uwin --> PARTS
  Utmd --> PARTS
  Uwhole --> PARTS

  %% ===== column 3 · SPLITS (apply to BOTH parts and dict_num parts) =====
  GSK["SequenceFeature.get_split_kws"]:::fn
  SPQ{"Which Split?
(segments & periodic patterns)"}:::dec
  SPC["Segment (n_split_max=1)"]:::rep
  SPP["Segment (n_split_max>1) ·
Pattern · PeriodicPattern"]:::rep
  SPLITS["split_kws"]:::df
  INQ -->|"splitting"| GSK
  GSK --> SPQ
  SPQ -->|"composition-like"| SPC
  SPQ -->|"position-resolved"| SPP
  SPC --> SPLITS
  SPP --> SPLITS

  %% ===== column 4 · numerical values (alternative to SCALES; only feeds run_num) =====
  PREP["EmbeddingPreprocessor
StructurePreprocessor
AnnotationPreprocessor
→ dict_num"]:::fn
  NFP["NumericalFeature.get_parts
→ dict_num_parts"]:::fn
  INQ -->|"numerical residue values"| PREP
  PREP --> NFP

  %% ---- CPP cores ----
  CPP(["  CPP.run  
Part × Split × Scale
→ df_feat"]):::algo
  CPPN(["  CPP.run_num  
Part × Split
→ df_feat"]):::algo
  SCALES -->|"Scale"| CPP
  PARTS -->|"Part"| CPP
  SPLITS -->|"Split"| CPP
  NFP -->|"Part"| CPPN
  SPLITS -->|"Split"| CPPN

  classDef dec fill:#fdebcf,stroke:#e0902c,stroke-width:1.8px,color:#7a4a00
  classDef algo fill:#ffffff,stroke:#333131,stroke-width:2.5px,color:#333131,font-weight:bold
  classDef fn fill:#ffffff,stroke:#C9C9C9,color:#333131
  classDef df fill:#ffffff,stroke:#C9C9C9,color:#333131,font-weight:bold
  classDef rep fill:#ffffff,stroke:#C9C9C9,color:#333131
  classDef spacer fill:transparent,stroke:transparent,color:transparent

CPP needs two inputs: features (Parts × Splits × Scales, built in the CPP block below) and labels (test group = 1 vs reference group = 0). Sampling builds the label-0 reference group: use your own negatives, AAWindowSampler decoys (residue level), or dPULearn — a deterministic PU-learning (ML) step that mines reliable negatives from positives + an unlabeled pool by PCA / distance on a feature matrix X (embeddings or features). It produces labels and is independent of CPP. Splits apply to both df_parts and dict_num parts — only Scales are sequence-only: numerical inputs already carry their values, so they skip Scales and go through CPP.run_num. Two axes of a prediction task: kind of task (classification / regression / multi-class) × kind of data / unit (residue-window / domain-TMD / whole protein). Explain: CPPPlot visualizes the CPP group signature straight from df_feat (no model); TreeModel (global importance) and ShapModel (local SHAP) need a fitted model on X. Colors: blue = class / method / function (cheat-sheet style); dashed = dev-preview / planned (post v1.1). Plot pairs: every analysis class has a mirror *Plot.