StructurePreprocessor.build_scales

StructurePreprocessor.build_scales(df_seq, dict_num, features, return_std=False, dim_names_override=None)[source]

Build df_scales by context-free per-amino acid (AA) averaging of the encoded corpus.

Mirrors EmbeddingPreprocessor.build_scales(): for each canonical amino acid a and each D dimension d, the pseudo-scale entry is the mean of dict_num[entry][i, d] over all (entry, i) pairs where df_seq[sequence][entry][i] == a. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.

This is the dataset-dependent step. The values feed CPP.run_num()’s redundancy filter (df_scales.corr() arm); a meaningful corpus is required to make max_cor discriminative. Compute pseudo-scales once on a fixed reference corpus and reuse for cross-dataset comparability.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts.

  • dict_num (dict[str, np.ndarray]) – Combined per-residue tensors {entry: (L_entry, D_total) ndarray} — typically the output of aaanalysis.combine_dict_nums(). Every entry in df_seq must be a key; per entry, L_entry == len(sequence); D_total must equal sum(REGISTRY[f]['num_dims'] for f in features) (i.e. the encoder outputs in feature-key order).

  • features (list of str) – Feature keys from the StructurePreprocessor registry in the same order as the dict_num D-axis layout. Used to name the D dimensions of the output and to validate D_total.

  • return_std (bool, default=False) – If True, also return per-AA standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.

  • dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal D_total. None uses the registry default names.

Returns:

  • df_scales (pd.DataFrame, shape (20, D_total)) – Pseudo-scale DataFrame. Rows are the 20 canonical AAs (ut.LIST_CANONICAL_AA); columns are dim names. Cells are context-free per-AA means of normalized encoder outputs (each in [0, 1]); NaN where the AA is absent from the corpus.

  • df_stds (pd.DataFrame, shape (20, D_total)) – Per-AA standard deviations, returned only when return_std=True.

Raises:

ValueError – On missing df_seq / dict_num, mismatched D, missing entries, or invalid feature keys.

Warning

UserWarning

Pseudo-scales depend on the content of df_seq + dict_num.

Examples

build_scales collapses a per-residue structure dict_num into context-free per-amino-acid means — a df_scales of shape (20, D) — that keeps CPP.run_num’s redundancy correlation gate live (and feeds the scale-based CPP.run). Here we stand in a small dict_num for two feature dimensions.

import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
stp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})

rng = np.random.default_rng(0)
# In practice this dict_num comes from encode_pdb / encode_pae / encode_dssp.
dict_num = {e: rng.random((len(s), 2))
            for e, s in zip(df_seq['entry'], df_seq['sequence'])}
df_scales = stp.build_scales(df_seq=df_seq, dict_num=dict_num,
                             features=['bfactor', 'plddt'])
df_scales
bfactor plddt
A 0.604246 0.295828
C 0.317637 0.177219
D 0.602445 0.901515
E 0.416897 0.676342
F 0.313820 0.883858
G 0.801476 0.121054
H 0.866944 0.046077
I 0.532886 0.162968
K 0.656759 0.668893
L 0.265177 0.237354
M 0.028320 0.124283
N 0.670624 0.647190
P 0.615385 0.383678
Q 0.997210 0.980835
R 0.685542 0.650459
S 0.688447 0.388921
T 0.135097 0.721488
V 0.525354 0.310242
W 0.485835 0.889488
Y 0.934044 0.357795

Further parameters. StructurePreprocessor.build_scales also accepts: return_std — If True, also return per-AA standard deviations in a second DataFrame of the same shape; dim_names_override — Replacement names for the D columns; length must equal D_total.