Release notes
Version 1.1
v1.1.0 (Unreleased)
This release substantially expands the feature-engineering surface: a unified feature-preprocessor family (embedding / structure / annotation sources), a numerical mode for CPP, a configuration-sweep wrapper, sequence-window sampling, and a suite of site-localization metrics and plotting helpers.
Added
Data Handling
EmbeddingPreprocessor: Instance-based class for per-residue protein language model (PLM) embeddings. The primary
encodemethod normalizes raw embeddings into a[0, 1]per-residuedict_num(method='minmax' | 'quantile' | 'sigmoid') ready forCPP.run_num; the secondarybuild_scales/build_catpair collapses them into pseudo-scales / pseudo-categories forCPP.run. Thefetch_embeddingsmethod (aaanalysis[embed]) downloads a curated PLM (ESM-2, ESM-1b, ProtT5, ProstT5) from the Hugging Face Hub and computes per-protein (mode='protein', mean/max/cls pooling) or per-residue (mode='residue') embeddings, with a hardware-aware size guard; thepool_embeddingshelper reduces per-residue arrays to per-protein vectors. A new[embed]install extra isolates the heavytorch/transformersdependencies (see ADR-0029).StructurePreprocessor (
aaanalysis[pro]): Converts PDB / CIF / AlphaFold files (and AlphaFold PAE sidecars) into[0, 1]-normalized per-residue numerical tensors. Methods:get_dssp,encode_dssp,encode_pdb,encode_pae,get_domains,encode_domains,build_scales,build_cat.AnnotationPreprocessor (
aaanalysis[pro]): Fetches from UniProt (or ingests user / predictor labels) per-residue PTM and functional-site annotations and encodes them into per-residue tensors. Methods:fetch_uniprot,ingest,register_feature,encode,build_scales,build_cat,to_df_seq.combine_dict_nums: Concatenates multiple per-residue tensors (embeddings / structure / annotation) along the feature axis to build a combined
CPP.run_numinput.
Feature Engineering
CPPGrid:
Tool-style wrapper (run+eval) that runs a grid sweep ofCPPconfigurations in one call, parallelized across configurations. Configurations that differ only inn_filterare collapsed into a single CPP run, with the remaining configurations served as exacthead(n)slices.runalso storeslist_df_feat_/df_params_;eval(sort_by=...)scores the configurations (byavg_ABS_AUCby default) and returns them best-first.CPP.run_num: New numerical-mode method whose per-residue value source is a pre-sliced numerical tensor (
dict_num_parts) rather than an amino-acid → scale lookup, enabling embedding / structure / annotation features through the same pipeline and output schema asCPP.run.CPP.simplify ``candidate_search=’fast’``: New opt-in heuristic that caps the candidate scales evaluated per feature, for a large speed-up on big scale pools (mainly the
greedystrategy). The defaultcandidate_search='exact'is unchanged and reproduces the previous result exactly;'fast'is statistically equivalent — the kept-feature set may differ but stays within a documented quality band (kept-feature Jaccard ≥ 0.95 and ΔavgABS_AUC ≤ 0.005 vs exact on the canonical data).SequenceFeature.get_labels_ovr / get_labels_ovo: Convert multi-class
labelsinto binary label sets forCPP— one-vs-rest (K full-length arrays, all samples kept) or one-vs-one (per class-pair). The row-droppingget_labels_ovotakes the value source (df_partsand/ordict_num_parts) and returns each pair’s row-matched copy ready forCPP.run/CPP.run_num.SequenceFeature.get_labels_quantile / get_labels_tiered: Discretize a continuous target into binary
labelsfor regression-styleCPP— a single quantile cut (all samples kept), or a fixed positive set swept against stepwise-lowered negative cuts,get_labels_tieredreturning each tier’s row-matcheddf_parts/dict_num_partssubset.SequenceFeature.get_df_parts_from_windows: Assemble a reference
df_partsfrom per-part window sets (e.g.AAWindowSampler.sample_syntheticoutput), so each sequence part can be generated with its own recipe.SequenceFeature.get_seq_kws: Return one protein’s
{jmd_n_seq, tmd_seq, jmd_c_seq}as a ready-to-splatseq_kwsdict (selected by entry or position), with the parts taken fromdf_partsso the residues stay bound to the feature geometry (no JMD-length argument;df_seqis cross-checked). Removes the manualget_df_partsslicing glue when passing a per-protein sequence toCPPPlot.profile/CPPPlot.feature_map(e.g. sample-level SHAP plots).SequenceFeature.get_feature_descriptions: Build one standardized, human-readable sentence per
PART-SPLIT-SCALEfeature id, combining the sequence region, the split (e.g."segment 2 of 4"), and the AAontology scale name, category, and subcategory. Complements the compactget_feature_nameslabel; the description is additive (the'feature'id is unchanged) and can be assigned to an optional'feature_description'df_featcolumn.AAclust.select_scales: Convenience wrapper around
AAclust.fitthat takes an amino acid scales DataFrame (rows = amino acids, columns = scale IDs) and returns the redundancy-reduced subset of its columns (one medoid scale per cluster) directly — collapsing the manual transpose,namesbookkeeping, and medoid-name indexing into a single call ready forCPP.AAclust.select_proteins: Protein-level redundancy reduction over a pre-pooled per-protein feature matrix (
X: CPP features, pooled embeddings, or structural / DSSP-derived features). Clusters the proteins and selects one representative (medoid) per cluster, annotatingdf_seqwithcluster/is_representative/dist_to_rep(return_data='annotated' | 'filtered' | 'both') — the numerical counterpart to sequence-identity reduction viafilter_seq.AAclustPlot.centers / AAclustPlot.medoids accept df_scales: Both methods gain a
df_scalesargument (the amino acid scales DataFrame, mutually exclusive withX) that is transposed to the feature matrix internally — soaac_plot.centers(df_scales=df_scales, labels=aac.labels_)replaces the manualcenters(np.array(df_scales).T, labels=aac.labels_). This removes the transpose ambiguity: pass scales viadf_scales(transposed for you) and proteins / embeddings / CPP features viaX(a samples-by-features matrix, used as-is) — never transpose manually. The explicitXsignature is unchanged.
Explainable AI
ShapModel — accession-based interface (
aaanalysis[pro]):fitaccepts entry-keyed soft labels viafuzzy_labels={'P05067': 0.6}together withdf_seq, overriding the matching entries inlabelsand enabling fuzzy labeling without a manual row-index lookup or array mutation.add_feat_impactandadd_sample_mean_difadditionally acceptdf_seqand a newsamplesparameter that takes either row position(s) or entry name(s), resolved to the matching row(s) (column names default to the accession). The arraylabels+fuzzy_labeling=Truepath is unchanged; the formersample_positionsparameter remains as a deprecated alias forsamples(emits aDeprecationWarning; removed in 1.2.0).
Sequence Analysis
AAWindowSampler: Samples fixed-length sequence windows for PU-learning and hard-negative-mining workflows (
sample_same_protein,sample_different_protein,sample_motif_matched,sample_synthetic).scan_motif (
aaanalysis[pro]): scans candidate proteins for statistically significant PWM occurrences via MEME/FIMO (selection by match p-value against a background model), complementing the pure-PythonAAWindowSampler.sample_motif_matchedPWM-sum sampler.
Metrics
comp_per_protein_ap: Per-protein average precision for site-localization ranking, with an optional
tolerance=±kvariant for positional jitter.comp_detection_metrics: Recall / precision / F1 / MCC at a fixed score threshold, pooled across per-residue predictions.
comp_bootstrap_ci: Seeded percentile confidence interval over a per-protein metric vector for small-N uncertainty reporting. Returns a dict
{'mean', 'ci_low', 'ci_high'}.comp_smooth_scores: Peak-preserving (
max(smoothed, raw)), NaN-aware smoothing of per-residue score tracks.
Plotting
plot_rank: Standalone per-protein max-score-vs-rank scatter with group coloring and optional threshold lines (pairs with the new
aa.metricsfunctions).
Package
aa.__version__: The installed package version is now exposed as a top-level attribute via
importlib.metadata.CHANGELOG.md + deprecation policy: A root
CHANGELOG.md(Keep a Changelog format) now gives a terse, developer-facing index alongside these narrative notes. The project adopts strict semantic versioning: from v1.x onward, any rename or removal of a public symbol ships at least one minor release carrying aDeprecationWarningfirst. Adeprecated(reason, version_removed)decorator helper (internal,aaanalysis.utils) marks such symbols and prepends a deprecation note to their docstring. See the Versioning and Deprecation Policy inCONTRIBUTING.rst.
Documentation
Prediction tasks concept-overview page (Usage Principles): maps a biological question to the right AAanalysis workflow via a task table keyed on unit of comparison and reference construction (not biological scale alone), covering the residue / domain / protein levels plus the determinant-discovery, design/engineering, and relational-boundary rows. The front door to the Protocols catalog; taxonomy recorded in ADR-0022.
A minimal CPP analysis tutorial (
tutorial0_minimal): the shortest end-to-end loop — load a dataset, run CPP, read out the signature — paired with the new concept page.
Changed
CPP performance work: The Cython feature-matrix kernel, macOS-safe threaded
n_jobs, scale / AA-index caching, and scale / sample batching land in this release, replacing the hour-long, low-CPU CPP runs seen on1.0.3and earlier. Users on≤1.0.3should upgrade rather than debug a performance pathology that is already fixed.CPP Cython-fallback notice: When the compiled extension is missing and CPP falls back to the ~2× slower pure-Python kernel, the one-time notice is now a
UserWarninginstead of an easily-missed INFO print, so it surfaces even withaa.options['verbose'] = False.SequenceFeature.feature_matrix: New
batch=parameter accepts a list ofdf_partsand builds them in a single Cython pass, returning a list of feature matrices — faster than per-call construction for many small part tables.SequenceFeature.get_df_parts / NumericalFeature.get_parts: New
pos-anchor input mode (tmd_len=) explodes each 1-based anchor in theposcolumn into one three-part (jmd_n/tmd/jmd_c) row, identified byentry_win.SequenceFeature.get_df_parts: Several-fold faster on large inputs — the per-row
DataFrame.applydriver was replaced with a vectorized iteration over the raw column arrays. The output (parts, column order, index, values) is unchanged.CPP / feature-engineering same-output speedups: Three byte-identical optimizations.
SequenceFeature.prune_by_correlation/NumericalFeature.filter_correlationvectorize the inner correlation-triangle comparison while preserving the greedy, order-dependent skip (the selected mask is unchanged).CPP.simplify’s redundancy reduction replaces a per-pair double pandas lookup into the scale-correlation table with a numpy view built once, keeping the sequential greedy tie-break (kept set and order unchanged). The greedy swap loop drops a per-candidate full-matrix copy in favor of a single mutated-column save/restore (memory only; scored matrix and selected set unchanged).CPP.simplify’s per-feature candidate ranking (_eligible_candidates_) replaces the Python scan over every pool scale with a numpy filter and a stablelexsortrank, and hoists the per-scale interpretability array once across the whole call — the ranked candidate list (values, tie order, and float dtype) is byte-identical.n_jobs: Unified parallelism convention across
CPP/CPPGrid(1serial,-1all cores,N>1exactly N,Noneoptimized), with anoptions['n_jobs']global override.CPPPlot.feature: Now titles the plot with the feature’s human-readable description (from
SequenceFeature.get_feature_descriptions), line-wrapped via the newshow_title(defaultTrue) andtitle_wrap_width(default45) parameters. A subsequentplt.title(...)still overrides it;feature_mapandrankingare unchanged.Docstring discoverability: Surfaced previously implicit API contracts at the docstrings users actually read (no behavior change).
CPP.run_num/NumericalFeature.get_partsnow state theget_parts→run_numcall order and the[0, 1]normalization contract (and what breaks if unnormalized); the[pro]classes / functions (ShapModel,StructurePreprocessor,AnnotationPreprocessor,comp_seq_sim,filter_seq,scan_motif) carry a[pro]install marker in their summary; andSeqMutcross-links the canonicaldf_seqformat spec (SequenceFeature.get_df_parts).Performance (same output): Several internal hotspots were vectorized or parallelized without changing results.
AAWindowSamplerredundancy / similarity filtering now compares amino-acid windows with vectorized NumPy operations (identical keep/drop decisions; ~30x faster at scale),AAclustsample-to-medoid correlation distances are computed in one pass, and the per-feature Kullback-Leibler divergence (used bydPULearn.evalwithcomp_kld=True) is parallelized over features and honorsoptions['n_jobs']. Public APIs and outputs are unchanged. A further pass vectorizesAAWindowSamplerwindow-sampling internals: candidate-center band filtering (~40x faster at scale) and per-window PWM scoring insample_motif_matched(~12x), again with identical results.SequencePreprocessor.encode_one_hotis also vectorized (~3x), with a byte-identical feature matrix. AndStructurePreprocessor.encode_pdbCA-CA contact counts (contact_count_8A/contact_count_12A) use a vectorized per-residue distance computation (~50x at scale) with byte-identical counts.encode_pdbadditionally caches the per-(target, atom) global sequence alignment that its encoders otherwise re-run ~26 times per entry (chain pick plus each per-feature value mapping); the first optimal alignment is deterministic, so cached and recomputed encoder output are byte-identical (~12x off the repeated-alignment overhead). Three furtherStructurePreprocessorhotspots are sped up with identical output:encode_pdb’s disulfide encoder replaces its O(n^2) SG-SG double loop with a vectorized pairwise-distance computation (same 2.5 Å inclusive boundary, equidistant-tie handling and nearest-partner pick; 16-48x); the three pLDDT encoders (plddt/plddt_disorder/plddt_tier) now share one per-residue pLDDT read + alignment per entry instead of each re-walking the structure (~2.2x); andget_dsspreuses a single session-scopedPairwiseAligneracross all entries (and across the chain-pick, mismatch-count and feature-align steps) rather than constructing three fresh aligners per entry. Identity fractions and alignments are byte-identical in every case. Completing this sharing,encode_pdbnow collects the structure’s amino-acid chains and picks the best-matching chain once per entry and threads that result through every requested encoder, rather than re-walking the structure (and rebuilding each chain’s atom sequence) once per feature — up to ~13 redundant walks collapsed to one. The pick is a deterministic function of the structure and target sequence, so the shared and per-encoder results are byte-identical. A further “Batch 6” pass replaces two more hotspots in place with byte-identical implementations:AAMut.comp_substitution_impactnow accumulates the per-(from, to) delta columns and builds a single DataFrame instead of concatenating hundreds of one-pair frames (byte-identical table, ~19x); andSequencePreprocessor.get_sliding_aa_windowinlines a strided window slice rather than re-padding the sequence string on every position (byte-identical window list, ~1.8x).Performance benchmark + regression guard (developer tooling): A committed
pytest-benchmarksuite (tests/benchmarks/) micro-benchmarks the hot public entry points —CPP.run/CPP.run_num,AAclust.fit,SequenceFeature.feature_matrix,AAWindowSamplersampling,dPULearn.fit,TreeModel.fit, andStructurePreprocessor.encode_pdb— on small bundled fixtures. A baseline-comparison helper (.github/scripts/check_perf_regression.py) flags any path slower than a generous1.5xthreshold, wired as a non-gating nightly job (perf_nightly.yml). Opt-in via the new[bench]install extra; it never touches the blocking matrix. No effect on the public API.Pooled, optionally concurrent web fetches:
StructurePreprocessor.fetch_alphafoldandAnnotationPreprocessor.fetch_uniprotnow route every request through a pooledrequests.Session(one per worker thread) rather than opening a fresh connection per request, and accept a newmax_workersparameter for threaded bulk fetching. Concurrency is off by default (max_workers=Noneor1keeps the unchanged sequential path) because parallel requests to AlphaFold DB / UniProt risk HTTP-429 throttling; when enabled, results are reassembled in input order, so the returned status table /df_annotand the on-disk files are byte-identical regardless of worker count.dPULearn.fit: Flexible, package-consistent label handling via
label_pos/label_unl/label_negmarkers. Pass standard{0, 1}labels directly withlabel_unl=0(0= unlabeled,1= positive), or any positive / unlabeled / negative encoding. Only unlabeled samples are candidates — pre-labeled negatives (label_neg) are kept and never re-selected. The negative count is specified one of two ways (exactly one): the newn_neg(the total number of negatives wanted, so dPULearn identifiesn_negminus the pre-labeled negatives), or the existingn_unl_to_neg(the number identified directly from the unlabeled pool, for direct control). Output labels always use the package convention (1= positive,0= negative,2= unlabeled); the recommended input encoding is unchanged.Numerical-equivalence tolerance policy (developer-facing): A new policy (ADR-0032, summarized in
CONTRIBUTING.rst) defines three tiers of acceptable output change for performance optimizations — T1 byte-identical (default), T2 numerically-equivalent (allclose(atol=1e-10, rtol=0)plus identical discrete decisions), and T3 statistically-equivalent (documented quality metric within an agreed band) — and the evidence + pinned regression anchor each tier requires. It unblocks previously-excluded algorithmic optimizations (e.g. AAclust binary-searchk), each landing as its own tier-declared PR. No user- facing behavior changes in this release.
Version 1.0 (Stable Version)
v1.0.3 (2026-04-06)
Added
AAlogo: New class for amino acid logo visualization.
AAlogoPlot: New plotting class for AAlogo visualizations.
Changed
Python Support: Dropped Python 3.9 (end-of-life) and added Python 3.13 and 3.14 support. Supported versions are now 3.10, 3.11, 3.12, 3.13, and 3.14.
Dependency Management: Migrated from
requirements.txtfiles to a singlepyproject.tomlas the source of truth for all dependencies. Introduced structured dependency extras:aaanalysis[pro],aaanalysis[docs], andaaanalysis[dev].Package Manager: Added full
uvsupport alongside existingpipandPoetrycompatibility.CI/CD: Updated all GitHub Actions workflows to reflect new Python version matrix and consolidated dependency installation via extras.
Other
Documentation: Updated
ReadTheDocsconfiguration to install dependencies directly frompyproject.tomlviaaaanalysis[docs]extra.Cleanup: Removed legacy
requirements.txt,docs/requirements_dev.txt, anddocs/requirements_wo_pro.txtfiles.
v1.0.2 (2025-06-17)
Improved
Faster CPP Pipeline: Major performance boost in
CPP.run()through optimized generation and filtering of part-split-scale combinations. Depending on the number of scales, runtime is now 3–5× faster on standard hardware.Feature Map Enhancement:
CPP.feature_map()now includes a top bar plot showing cumulative feature importance per residue, improving interpretability. This visualization is also included in the CPP profile output.
Fixed
StructurePreprocessor.fetch_alphafold: Resolve download URLs through the AlphaFold API instead of a hardcoded file version. AlphaFold DB renamed its files
v4→v6, which had silently broken every fetch (all entries returnedalphafold_ok=False); the fetch now tracks the current version automatically. Added anetwork-marked live test (tests/integration/) so an upstream API/version change is caught instead of slipping past the mocked unit tests.General Bug Fixes: Minor fixes related to dependency resolution and edge-case behavior.
Documentation: Removed inconsistencies in documentation for selected functions and plotting options.
Other
Branding: Introduced updated logo and favicon (legacy version preserved under docs/source/_artwork/logos/legacy/).
Landing Page Visual: Added a main conceptual sketch to the documentation landing page illustrating the core CPP idea — comparing two sequence sets to derive their critical difference, the physicochemical signature.
v1.0.1 (2025-01-29)
Improved
Pro Feature Accessibility: Improved integration of aaanalysis[pro] features in IDEs. Clicking on a pro feature now directs users to its exact class implementation instead of the main
__init__.pyfile.Import Error Handling: Improved error handling for missing dependencies in the aaanalysis[pro] version. If dependencies are installed but errors occur during import, users now receive the original import error messages.
Fixed
Feature Map Plot: Resolved a potential mismatch in subcategory ordering between heatmap and bar plot in
aa.cpp_plot().featuremap(). Previously, subcategories with nearly identical names (e.g., “α-helix (C-term)” and “α-helix (C-term, out)”) could appear in an inconsistent order.General Bug Fixes: Minor bug fixes to improve overall stability and functionality.
Other
Dependencies: All dependencies have been updated to ensure compatibility with the latest versions, including full support for
numpy>=2.0.0.
v1.0.0 (2024-07-01)
Added
SequencePreprocessor: A utility data preprocessing class (data handling module).
comp_seq_sim: A function for computing pairwise sequence similarity (data handling module).
filter_seq: A function for redundancy-reduction of sequences (data handling module).
options: Juxta Middle Domain (JMD) length can now be globally adjusted using the jmd_n/c_len options.
Changed
ShapModel: The ShapExplainer class has been renamed to ShapModel for consistency with the TreeModel class and to avoid confusion with the ShapExplainer models from the SHAP package.
Dependencies: Biopython is now a required dependency only for the aaanalysis[pro] version.
Module Renaming: The Perturbation module has been renamed to Protein Design module to better reflect its broad functionality.
Fixed
Multiprocessing: Now supported directly at the script level, outside of any functions or classes, in the top-level of the script (global namespace).
Version 0.1 (Beta Version)
v0.1.5 (2024-04-18)
Added
Code of Conduct: Introduced a Code of Conduct to foster a welcoming and inclusive community environment. We encourage all contributors to review the Code of Conduct to understand the expectations and responsibilities when participating in the project.
Changed
License Update: Transitioned the project license from MIT to BSD-3-Clause to better align with our project’s community engagement and protection goals. This change affects how the software can be used and redistributed.
Fixed
Multiprocessing: Replaced native
multiprocessingwith thejoblibmodule for CPP and internal feature matrix creation. This change prevents aRuntimeErrorthat occurred when the main function is not explicitly used.
Other
Dependencies: Update the
seaborndependency to version 0.13.2 or higher to resolve the legend argument error present in versions earlier than 0.13
v0.1.4 (2024-04-09)
Added
Installation Options: Introduced separate installation profiles for the core and professional versions. The core version has reduced dependencies to enhance installation robustness, installable using
pip install aaanalysis. The professional version, designed for advanced usage, includes packages required for our explainable AI module such as SHAP, installable usingpip install aaanalysis[pro].
Changed
API Improvements: General improvement of API for consistency and higher user-friendliness.
Fixed
General Issues: Fix of different check function related API issues.
Other
Python Dependency: Updated the Python version compatibility from <= 3.10 to <= 3.12.
v0.1.3 (2024-02-09)
Added
TreeModel: Wrapper class of tree-based models for Monte Carlo estimates of predictions and feature importance. See TreeModel.
ShapExplainer: A wrapper for SHAP (SHapley Additive exPlanations) explainers to obtain Monte Carlo estimates for feature impact. See ShapExplainer.
NumericalFeature: Utility feature engineering class to process and filter numerical data structures. See NumericalFeature.
Load_feature: Utility function to load feature sets for protein benchmarking datasets. See load_features.
Changed
API Improvements: General improvement of API for consistency and higher user-friendliness.
Fixed
Interface: Change of internal documentation decorator to hard-coded documentation for better IDE responsiveness.
General Issues: Fix of different check function related API issues.
v0.1.2 (2023-11-06)
Added
CPPPlot: Plotting class for CPP features. See CPPPlot.
dPULearnPlot: Plotting class for results of negative identifications by dPULearn. See dPULearnPlot.
AAclustPlot: Plotting class for AAclust clustering results. See AAclustPlot.
Options: Set system-level settings by a dictionary-like interface (similar to pandas). See options.
Plotting functions: Extension of plotting utility functions.
Changed
API Improvements: General improvement of API.
Fixed
API Improvements: General improvement of API (Application Programming Interface).
Other
Python Dependency: Supports Python versions 3.9 and 3.10.
v0.1.1 (2023-09-11)
Test release of the first beta version.
v0.1.0 (2023-09-11)
First release of the beta version including CPP, dPULearn, and AAclust algorithms as well as the SequenceFeature utility class and data loading functions load_dataset and load_scales.