aaanalysis.EmbeddingPreprocessor.build_scales
- EmbeddingPreprocessor.build_scales(df_seq=None, dict_num=None, return_std=False)[source]
Build pseudo-scales by context-free averaging of per-residue embeddings.
For each canonical amino acid
aand each embedding dimensiond, the pseudo-scale entry is the mean ofembeddings[entry][i, d]over all (entry, i) pairs whereseq[i] == a, taken over the inputdf_seq. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Used here as the source of empirical amino-acid contexts over which embedding dimensions are averaged.dict_num (dict[str, np.ndarray]) – Mapping from entry to a per-residue embedding array of shape
(L, D)whereLis the protein length andDis the embedding dimensionality. Every entry indf_seqmust be a key; all arrays must share the sameD. Same shape contract as thedict_numconsumed byCPP.run_num().return_std (bool, default=False) – If
True, also return per-AA population standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.
- Returns:
df_scales (pd.DataFrame, shape (20, D)) – Pseudo-scale DataFrame. Rows are the 20 canonical amino acids in alphabetical order (
ACDEFGHIKLMNPQRSTVWY); columns are dimension labels (dim_0,dim_1, …,dim_{D-1}). Cells are context-free per-AA means of embedding values. Drop-in for thedf_scalesargument ofCPP.__init__().df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, returned only when
return_std=True. Same index and columns asdf_scales.
Warning
- UserWarning
Pseudo-scales depend on the content of
df_seq. The same embedding model applied to a different protein corpus produces a different pseudo-scale DataFrame.
See also
build_cat(): derive a two-level pseudo-category table from this output.encode(): the primary per-residue path (raw embeddings to a [0, 1] dict_num).
Examples
EmbeddingPreprocessor.build_scalescollapses per-residue embeddings into context-free per-amino-acid pseudo-scales (adf_scalesof shape(20, D)) for the scale-based :meth:CPP.runpath. Rows are the 20 canonical amino acids; columns are embedding dimensions.import numpy as np import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC", n=10) rng = np.random.default_rng(0) dict_num = {e: rng.normal(size=(len(s), 8)) for e, s in zip(df_seq["entry"], df_seq["sequence"])} ep = aa.EmbeddingPreprocessor() df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num) df_scales.head()
/tmp/claude-501/ipykernel_85601/800189660.py:11: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales. df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num)
dim_0 dim_1 dim_2 dim_3 dim_4 dim_5 dim_6 dim_7 A -0.020013 0.018147 0.005299 -0.032596 0.003972 0.039789 0.028859 0.031639 C 0.062436 -0.069334 -0.126105 0.143739 -0.011604 0.112091 0.067743 -0.114120 D 0.018843 0.022702 -0.036626 -0.036090 0.007288 0.005026 0.096982 0.010235 E -0.028588 -0.003467 -0.027357 0.028050 0.012907 0.014690 -0.002806 -0.098708 F 0.061095 -0.052086 -0.024316 -0.078165 -0.006445 0.045729 0.085096 -0.072524