EmbeddingPreprocessor.build_scales

EmbeddingPreprocessor.build_scales(df_seq, dict_num, return_std=False)[source]

Build pseudo-scales by context-free averaging of per-residue embeddings.

For each canonical amino acid (AA) a and each embedding dimension d, the pseudo-scale entry is the mean of embeddings[entry][i, d] over all (entry, i) pairs where seq[i] == a, taken over the input df_seq. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts over which embedding dimensions are averaged.
dict_num (dict[str, np.ndarray]) – Mapping from entry to a per-residue embedding array of shape (L, D) where L is the protein length and D is the embedding dimensionality. Every entry in df_seq must be a key; all arrays must share the same D. Same shape contract as the dict_num consumed by CPP.run_num().
return_std (bool, default=False) – If True, also return per-AA population standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.

Returns:

df_scales (pd.DataFrame, shape (20, D)) – Pseudo-scale DataFrame. Rows are the 20 canonical amino acids in alphabetical order (ACDEFGHIKLMNPQRSTVWY); columns are dimension labels (dim_0, dim_1, …, dim_{D-1}). Cells are context-free per-AA means of embedding values. Drop-in for the df_scales argument of CPP.__init__().
df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, returned only when return_std=True. Same index and columns as df_scales.

Warning

UserWarning: Pseudo-scales depend on the content of df_seq. The same embedding model applied to a different protein corpus produces a different pseudo-scale DataFrame.

See also

build_cat(): derive a two-level pseudo-category table from this output.
encode(): the primary per-residue path (raw embeddings to a [0, 1] dict_num).

Examples

EmbeddingPreprocessor.build_scales collapses per-residue embeddings into context-free per-amino-acid pseudo-scales (a df_scales of shape (20, D)) for the scale-based :meth:CPP.run path. Rows are the 20 canonical amino acids; columns are embedding dimensions.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
dict_num = {e: rng.normal(size=(len(s), 8))
            for e, s in zip(df_seq["entry"], df_seq["sequence"])}

embp = aa.EmbeddingPreprocessor()
df_scales = embp.build_scales(df_seq=df_seq, dict_num=dict_num)
df_scales.head()

/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_53494/1608571445.py:11: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales = embp.build_scales(df_seq=df_seq, dict_num=dict_num)

	dim_0	dim_1	dim_2	dim_3	dim_4	dim_5	dim_6	dim_7
A	-0.020013	0.018147	0.005299	-0.032596	0.003972	0.039789	0.028859	0.031639
C	0.062436	-0.069334	-0.126105	0.143739	-0.011604	0.112091	0.067743	-0.114120
D	0.018843	0.022702	-0.036626	-0.036090	0.007288	0.005026	0.096982	0.010235
E	-0.028588	-0.003467	-0.027357	0.028050	0.012907	0.014690	-0.002806	-0.098708
F	0.061095	-0.052086	-0.024316	-0.078165	-0.006445	0.045729	0.085096	-0.072524

Further parameters. EmbeddingPreprocessor.build_scales also accepts: return_std — If True, also return per-AA population standard deviations in a second DataFrame of the same shape.

# Further parameter: ``return_std=True`` also returns per-AA population std devs.
df_scales_m, df_scales_s = embp.build_scales(df_seq=df_seq, dict_num=dict_num,
                                             return_std=True)
aa.display_df(df_scales_s, n_rows=10, show_shape=True)

DataFrame shape: (20, 8)

/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_53494/409998547.py:2: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales_m, df_scales_s = embp.build_scales(df_seq=df_seq, dict_num=dict_num,

	dim_0	dim_1	dim_2	dim_3	dim_4	dim_5	dim_6	dim_7
A	0.988525	1.024820	0.989249	1.033074	1.005914	0.996319	1.008786	1.072368
C	0.944552	0.976138	0.931113	0.982619	1.043496	0.993911	0.962259	0.986483
D	0.996387	0.955538	0.985784	1.005916	0.979827	1.003738	1.016229	1.003530
E	0.991671	0.990279	1.038752	1.024295	0.978761	0.995130	0.984609	1.005054
F	0.945352	1.038595	1.049072	1.022857	0.978598	0.940124	0.951373	0.965100
G	1.013297	1.013405	1.041377	0.987687	1.016611	1.046612	0.987234	0.993304
H	0.996179	1.039378	1.020457	1.033986	0.943797	1.026756	0.994459	0.958460
I	1.037601	0.967379	1.039496	0.976098	1.008541	1.057195	1.026528	1.000749
K	0.962600	0.971309	0.968286	1.045759	1.022654	0.971736	0.965333	0.991673
L	1.057927	0.974958	1.016053	1.015421	1.002918	0.990676	1.000186	0.982037