StructurePreprocessor.build_scales

StructurePreprocessor.build_scales(df_seq, dict_num, features, return_std=False, dim_names_override=None)[source]

Build df_scales by context-free per-amino acid (AA) averaging of the encoded corpus.

Mirrors EmbeddingPreprocessor.build_scales(): for each canonical amino acid a and each D dimension d, the pseudo-scale entry is the mean of dict_num[entry][i, d] over all (entry, i) pairs where df_seq[sequence][entry][i] == a. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.

This is the dataset-dependent step. The values feed CPP.run_num()’s redundancy filter (df_scales.corr() arm); a meaningful corpus is required to make max_cor discriminative. Compute pseudo-scales once on a fixed reference corpus and reuse for cross-dataset comparability.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts.
dict_num (dict[str, np.ndarray]) – Combined per-residue tensors {entry: (L_entry, D_total) ndarray} — typically the output of aaanalysis.combine_dict_nums(). Every entry in df_seq must be a key; per entry, L_entry == len(sequence); D_total must equal sum(REGISTRY[f]['num_dims'] for f in features) (i.e. the encoder outputs in feature-key order).
features (list of str) – Feature keys from the StructurePreprocessor registry in the same order as the dict_num D-axis layout. Used to name the D dimensions of the output and to validate D_total.
return_std (bool, default=False) – If True, also return per-AA standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.
dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal D_total. None uses the registry default names.

Returns:

df_scales (pd.DataFrame, shape (20, D_total)) – Pseudo-scale DataFrame. Rows are the 20 canonical AAs (ut.LIST_CANONICAL_AA); columns are dim names. Cells are context-free per-AA means of normalized encoder outputs (each in [0, 1]); NaN where the AA is absent from the corpus.
df_stds (pd.DataFrame, shape (20, D_total)) – Per-AA standard deviations, returned only when return_std=True.

Raises:

ValueError – On missing df_seq / dict_num, mismatched D, missing entries, or invalid feature keys.

Warning

UserWarning: Pseudo-scales depend on the content of df_seq + dict_num.

Examples

build_scales collapses a per-residue structure dict_num into context-free per-amino-acid means — a df_scales of shape (20, D) — that keeps CPP.run_num’s redundancy correlation gate live (and feeds the scale-based CPP.run). Here we stand in a small dict_num for two feature dimensions.

import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
strp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})

rng = np.random.default_rng(0)
# In practice this dict_num comes from encode_pdb / encode_pae / encode_dssp.
dict_num = {e: rng.random((len(s), 2))
            for e, s in zip(df_seq['entry'], df_seq['sequence'])}
df_scales = strp.build_scales(df_seq=df_seq, dict_num=dict_num,
                             features=['bfactor', 'plddt'])
df_scales

	bfactor	plddt
A	0.604246	0.295828
C	0.317637	0.177219
D	0.602445	0.901515
E	0.416897	0.676342
F	0.313820	0.883858
G	0.801476	0.121054
H	0.866944	0.046077
I	0.532886	0.162968
K	0.656759	0.668893
L	0.265177	0.237354
M	0.028320	0.124283
N	0.670624	0.647190
P	0.615385	0.383678
Q	0.997210	0.980835
R	0.685542	0.650459
S	0.688447	0.388921
T	0.135097	0.721488
V	0.525354	0.310242
W	0.485835	0.889488
Y	0.934044	0.357795

Further parameters. StructurePreprocessor.build_scales also accepts: return_std — If True, also return per-AA standard deviations in a second DataFrame of the same shape; dim_names_override — Replacement names for the D columns; length must equal D_total.

# Further parameters: ``return_std=True`` also returns per-AA standard
# deviations, and ``dim_names_override`` renames the D columns.
df_scales_named, df_scales_std = strp.build_scales(
    df_seq=df_seq, dict_num=dict_num, features=['bfactor', 'plddt'],
    return_std=True, dim_names_override=['bfactor_d', 'plddt_d'])
aa.display_df(df_scales_std, n_rows=10, show_shape=True)

DataFrame shape: (20, 2)

	bfactor_d	plddt_d
A	0.032716	0.026041
C	0.276663	0.160692
D	0.210826	0.011241
E	0.189739	0.053155
F	0.229805	0.051214
G	0.014378	0.118315
H	0.009540	0.012491
I	0.196769	0.012688
K	0.206420	0.127432
L	0.034535	0.185333