API

This Application Programming Interface (API) is the public interface for the objects and functions of our AAanalysis Python toolkit, which can be imported by:

import aaanalysis as aa

You can then access all methods and objects via the aa alias, such as aa.load_dataset.

Data Handling

load_dataset([name, n, random, ...])

Load protein benchmarking datasets.

load_scales([name, just_aaindex, ...])

Load amino acid scales or their classification (AAontology).

load_features([name])

Load feature sets for protein benchmarking datasets.

read_fasta(file_path[, col_id, col_seq, ...])

Read a FASTA file into a DataFrame.

to_fasta([df_seq, file_path, col_id, ...])

Write sequence DataFrame to a FASTA file.

SequencePreprocessor()

Preprocessing class for representing protein sequences as numeric inputs [Breimann25].

StructurePreprocessor([verbose])

Preprocessing class ([pro], requires aaanalysis[pro]) for protein structure features (PDB / CIF / AlphaFold).

EmbeddingPreprocessor([verbose])

Preprocessing class for protein language model (PLM) embeddings.

AnnotationPreprocessor([verbose])

Preprocessing class ([pro], requires aaanalysis[pro]) for per-residue post-translational modification (PTM) / functional-site annotations.

combine_dict_nums([dict_nums])

Concatenate multiple per-residue dict_num inputs along the D axis.

Sequence Analysis

AAlogo([logo_type])

Amino Acid logo (AAlogo) class for computing sequence logo matrices and conservation scores.

AAlogoPlot([logo_type, jmd_n_len, ...])

Amino Acid logo Plot (AAlogoPlot) class for visualizing sequence logos.

AAWindowSampler([verbose, random_state, ...])

Utility class for sampling amino-acid windows / segments from full protein sequences.

comp_seq_sim([seq1, seq2, df_seq])

Compute pairwise similarity between two or more sequences ([pro], requires aaanalysis[pro]).

filter_seq([df_seq, method, ...])

Redundancy reduction of sequences using clustering-based algorithms ([pro], requires aaanalysis[pro]).

scan_motif([df_seq, pos_col, n, ...])

Scan candidate proteins for statistically significant Position Weight Matrix (PWM) occurrences using FIMO ([pro], requires aaanalysis[pro]).

Feature Engineering

AAclust([model_class, model_kwargs, ...])

Amino Acid clustering (AAclust) class: A k-optimized clustering wrapper for selecting redundancy-reduced sets of numerical scales [Breimann24a].

AAclustPlot([model_class, model_kwargs, ...])

Plotting class for AAclust (Amino Acid clustering) results, providing dimensionality-reduction scatter plots, correlation heatmaps, and clustering evaluation charts [Breimann24a].

SequenceFeature([verbose])

Utility feature engineering class using sequences to create CPP feature components (Parts, Splits, and Scales) and data structures [Breimann25].

NumericalFeature()

Utility feature engineering class to process and filter numerical data structures, such as amino acid scales or a feature matrix.

CPP([df_parts, split_kws, df_scales, ...])

Comparative Physicochemical Profiling (CPP) class to create and filter features that are most discriminant between two sets of sequences [Breimann25].

CPPGrid([df_seq, labels, dict_num, ...])

Grid-style sweep over Comparative Physicochemical Profiling (CPP) configurations (Tool) [Breimann25].

CPPPlot([df_scales, df_cat, jmd_n_len, ...])

Plotting class for CPP (Comparative Physicochemical Profiling) results [Breimann25].

PU Learning

dPULearn([model_kwargs, verbose, random_state])

Deterministic Positive-Unlabeled Learning (dPULearn) class for identifying reliable negatives from unlabeled data [Breimann25].

dPULearnPlot()

Plotting class for dPULearn (deterministic Positive-Unlabeled Learning) results [Breimann25].

Explainable AI

TreeModel([list_model_classes, ...])

Tree Model class: A wrapper for tree-based models to obtain Monte Carlo estimates of feature importance and predictions [Breimann25].

ShapModel([explainer_class, ...])

SHAP Model class ([pro], requires aaanalysis[pro]): A wrapper for SHAP (SHapley Additive exPlanations) [Lundberg20] explainers to obtain Monte Carlo estimates for feature impact [Breimann25].

Protein Design

AAMut([verbose, df_scales])

Amino Acid Mutator (AAMut) class for analyzing the physicochemical impact of amino acid substitutions on property scales [Breimann24a].

AAMutPlot([verbose, df_scales])

Plotting class for AAMut (Amino Acid Mutator) results [Breimann24a].

SeqMut([verbose, df_scales])

Sequence Mutator (SeqMut) class for CPP-guided sequence mutation and ΔCPP analysis [Breimann24a].

SeqMutPlot([verbose])

Plotting class for SeqMut (Sequence Mutator) results [Breimann24a].

Utility Functions

comp_auc_adjusted([X, labels, label_test, ...])

Compute an adjusted Area Under the Curve (AUC) [-0.5, 0.5] assessing the similarity between two groups.

comp_bic_score([X, labels])

Compute an adjusted Bayesian Information Criterion (BIC) (-∞, ∞) for assessing clustering quality.

comp_bootstrap_ci([values, n_rounds, ci, seed])

Compute a percentile bootstrap Confidence Interval (CI) of the mean.

comp_detection_metrics([list_scores, ...])

Compute pooled detection metrics at a fixed score threshold.

comp_kld([X, labels, label_test, label_ref])

Compute the Kullback-Leibler Divergence (KLD) [0, ∞) for assessing the similarity between two groups.

comp_per_protein_ap([list_scores, ...])

Compute per-protein average precision (AP) for windowed site prediction.

comp_smooth_scores([scores, method, window, ...])

Smooth a per-residue score vector with a NaN-aware, peak-preserving kernel.

display_df([df, max_width_pct, max_height, ...])

Display DataFrame with specific style as HTML output for jupyter notebooks.

options

A class for managing system-level settings for AAanalysis.

plot_gcfs([option])

Get the current font size (or axes linewidth).

plot_get_cdict([name])

Get color dictionaries specified for AAanalysis.

plot_get_clist([n_colors])

Get a manually curated list of 2 to 9 colors or 'husl' palette for more than 9 colors.

plot_get_cmap([name, n_colors, facecolor_dark])

Get colormaps specified for AAanalysis.

plot_legend([ax, dict_color, list_cat, ...])

Set an independently customizable plot legend.

plot_rank([df_rank, col_score, col_group, ...])

Plot a per-protein rank scatter: max-score-per-protein sorted by score, colored by group.

plot_settings([font_scale, font, ...])

Configure general plot settings.