StructurePreprocessor.build_scales
- StructurePreprocessor.build_scales(df_seq, dict_num, features, return_std=False, dim_names_override=None)[source]
Build
df_scalesby context-free per-amino acid (AA) averaging of the encoded corpus.Mirrors
EmbeddingPreprocessor.build_scales(): for each canonical amino acidaand each D dimensiond, the pseudo-scale entry is the mean ofdict_num[entry][i, d]over all (entry, i) pairs wheredf_seq[sequence][entry][i] == a. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.This is the dataset-dependent step. The values feed
CPP.run_num()’s redundancy filter (df_scales.corr()arm); a meaningful corpus is required to makemax_cordiscriminative. Compute pseudo-scales once on a fixed reference corpus and reuse for cross-dataset comparability.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Used here as the source of empirical amino-acid contexts.dict_num (dict[str, np.ndarray]) – Combined per-residue tensors
{entry: (L_entry, D_total) ndarray}— typically the output ofaaanalysis.combine_dict_nums(). Every entry indf_seqmust be a key; per entry,L_entry == len(sequence);D_totalmust equalsum(REGISTRY[f]['num_dims'] for f in features)(i.e. the encoder outputs in feature-key order).features (list of str) – Feature keys from the StructurePreprocessor registry in the same order as the
dict_numD-axis layout. Used to name the D dimensions of the output and to validateD_total.return_std (bool, default=False) – If
True, also return per-AA standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal
D_total.Noneuses the registry default names.
- Returns:
df_scales (pd.DataFrame, shape (20, D_total)) – Pseudo-scale DataFrame. Rows are the 20 canonical AAs (
ut.LIST_CANONICAL_AA); columns are dim names. Cells are context-free per-AA means of normalized encoder outputs (each in[0, 1]); NaN where the AA is absent from the corpus.df_stds (pd.DataFrame, shape (20, D_total)) – Per-AA standard deviations, returned only when
return_std=True.
- Raises:
ValueError – On missing
df_seq/dict_num, mismatched D, missing entries, or invalid feature keys.
Warning
- UserWarning
Pseudo-scales depend on the content of
df_seq+dict_num.
Examples
build_scalescollapses a per-residue structuredict_numinto context-free per-amino-acid means — adf_scalesof shape(20, D)— that keepsCPP.run_num’s redundancy correlation gate live (and feeds the scale-basedCPP.run). Here we stand in a smalldict_numfor two feature dimensions.import warnings from pathlib import Path import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test' stp = aa.StructurePreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) rng = np.random.default_rng(0) # In practice this dict_num comes from encode_pdb / encode_pae / encode_dssp. dict_num = {e: rng.random((len(s), 2)) for e, s in zip(df_seq['entry'], df_seq['sequence'])} df_scales = stp.build_scales(df_seq=df_seq, dict_num=dict_num, features=['bfactor', 'plddt']) df_scales
bfactor plddt A 0.604246 0.295828 C 0.317637 0.177219 D 0.602445 0.901515 E 0.416897 0.676342 F 0.313820 0.883858 G 0.801476 0.121054 H 0.866944 0.046077 I 0.532886 0.162968 K 0.656759 0.668893 L 0.265177 0.237354 M 0.028320 0.124283 N 0.670624 0.647190 P 0.615385 0.383678 Q 0.997210 0.980835 R 0.685542 0.650459 S 0.688447 0.388921 T 0.135097 0.721488 V 0.525354 0.310242 W 0.485835 0.889488 Y 0.934044 0.357795 Further parameters.
StructurePreprocessor.build_scalesalso accepts:return_std— IfTrue, also return per-AA standard deviations in a second DataFrame of the same shape;dim_names_override— Replacement names for the D columns; length must equalD_total.