aaanalysis.EmbeddingPreprocessor.build_scales

EmbeddingPreprocessor.build_scales(df_seq=None, dict_num=None, return_std=False)[source]

Build pseudo-scales by context-free averaging of per-residue embeddings.

For each canonical amino acid a and each embedding dimension d, the pseudo-scale entry is the mean of embeddings[entry][i, d] over all (entry, i) pairs where seq[i] == a, taken over the input df_seq. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts over which embedding dimensions are averaged.

  • dict_num (dict[str, np.ndarray]) – Mapping from entry to a per-residue embedding array of shape (L, D) where L is the protein length and D is the embedding dimensionality. Every entry in df_seq must be a key; all arrays must share the same D. Same shape contract as the dict_num consumed by CPP.run_num().

  • return_std (bool, default=False) – If True, also return per-AA population standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.

Returns:

  • df_scales (pd.DataFrame, shape (20, D)) – Pseudo-scale DataFrame. Rows are the 20 canonical amino acids in alphabetical order (ACDEFGHIKLMNPQRSTVWY); columns are dimension labels (dim_0, dim_1, …, dim_{D-1}). Cells are context-free per-AA means of embedding values. Drop-in for the df_scales argument of CPP.__init__().

  • df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, returned only when return_std=True. Same index and columns as df_scales.

Warning

UserWarning

Pseudo-scales depend on the content of df_seq. The same embedding model applied to a different protein corpus produces a different pseudo-scale DataFrame.

See also

  • build_cat(): derive a two-level pseudo-category table from this output.

  • encode(): the primary per-residue path (raw embeddings to a [0, 1] dict_num).

Examples

EmbeddingPreprocessor.build_scales collapses per-residue embeddings into context-free per-amino-acid pseudo-scales (a df_scales of shape (20, D)) for the scale-based :meth:CPP.run path. Rows are the 20 canonical amino acids; columns are embedding dimensions.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
dict_num = {e: rng.normal(size=(len(s), 8))
            for e, s in zip(df_seq["entry"], df_seq["sequence"])}

ep = aa.EmbeddingPreprocessor()
df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num)
df_scales.head()
/tmp/claude-501/ipykernel_85601/800189660.py:11: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num)
dim_0 dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7
A -0.020013 0.018147 0.005299 -0.032596 0.003972 0.039789 0.028859 0.031639
C 0.062436 -0.069334 -0.126105 0.143739 -0.011604 0.112091 0.067743 -0.114120
D 0.018843 0.022702 -0.036626 -0.036090 0.007288 0.005026 0.096982 0.010235
E -0.028588 -0.003467 -0.027357 0.028050 0.012907 0.014690 -0.002806 -0.098708
F 0.061095 -0.052086 -0.024316 -0.078165 -0.006445 0.045729 0.085096 -0.072524