EmbeddingPreprocessor.encode

EmbeddingPreprocessor.encode(df_seq, embeddings, method='minmax', clip=(1.0, 99.0), return_df=False)[source]

Encode raw per-residue protein language model (PLM) embeddings into a [0, 1]-normalized dict_num.

Raw PLM embeddings (ESM, ProtT5, …) are unbounded floats, whereas CPP.run_num() expects per-residue values in [0, 1] (the same normalization convention as StructurePreprocessor and AnnotationPreprocessor). encode fits one normalizer per embedding dimension over the whole corpus (all residues of all proteins in df_seq) and applies it to every entry, returning a dict_num that feeds straight into NumericalFeature.get_parts() → CPP.run_num(). The fitted parameters are stored on the instance (self.norm_params_) so the identical transform can be reproduced.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Defines which entries are encoded and validates that each embedding array’s length matches its sequence.
embeddings (dict[str, np.ndarray]) – Mapping from entry to a raw per-residue embedding array of shape (L, D) where L is the protein length and D is the embedding dimensionality. Every entry in df_seq must be a key; all arrays must share the same D. You compute these externally with your PLM of choice — AAanalysis does not run the model.
method ({'minmax', 'quantile', 'sigmoid'}, default='minmax') – Per-dimension normalization to [0, 1]. 'minmax' linearly rescales each dim between its corpus min and max; 'quantile' does the same between robust percentiles (see clip) so outlier residues do not crush the range; 'sigmoid' z-scores each dim and applies a logistic squash.
clip (tuple of float, default=(1.0, 99.0)) – Lower / upper percentiles used only when method='quantile'.
return_df (bool, default=False) – If True, also return an echo of df_seq as a second element.

Returns:

dict_num (dict[str, np.ndarray]) – {entry: (L, D) ndarray} with all values in [0, 1], same shape as embeddings. Stack with other per-residue tensors via aaanalysis.combine_dict_nums(), slice with NumericalFeature.get_parts(), then run CPP.run_num().
df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq.

Notes

The normalizer is fit over the supplied corpus, so it is dataset-dependent: the same embeddings normalized against a different df_seq yield different values. For reproducible cross-dataset comparison, fit once on a fixed reference corpus.

See also

build_scales(): the secondary context-free amino acid (AA)-scale path (for CPP.run).
CPP.run_num(): the per-residue consumer of the returned dict_num.
aaanalysis.combine_dict_nums(): stitch several dict_nums together.

Examples

EmbeddingPreprocessor.encode turns raw per-residue protein-language-model (PLM) embeddings into a [0, 1]-normalized dict_num ready for :meth:CPP.run_num. You compute the embeddings externally with your model of choice (ESM, ProtT5, …); here we stand in a small random tensor. The normalizer is fit per embedding dimension over the whole corpus.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
# Replace this with real PLM output: {entry: (L, D) array}.
embeddings = {e: rng.normal(size=(len(s), 8))
              for e, s in zip(df_seq["entry"], df_seq["sequence"])}

embp = aa.EmbeddingPreprocessor()
dict_num = embp.encode(df_seq=df_seq, embeddings=embeddings, method="minmax")
arr = next(iter(dict_num.values()))
print(arr.shape, round(float(arr.min()), 3), round(float(arr.max()), 3))

(87, 8) 0.0 0.913

Every array is now in [0, 1]. Slice it with :meth:NumericalFeature.get_parts and feed :meth:CPP.run_num, or combine several sources first with :func:aaanalysis.combine_dict_nums.

# Further parameters: ``clip`` sets the percentile bounds used by the 'minmax' /
# 'quantile' normalizers, and ``return_df=True`` also returns a per-entry status
# frame alongside the dict_num.
dict_num_clip, df_encoded = embp.encode(df_seq=df_seq, embeddings=embeddings,
                                        method="minmax", clip=(2.0, 98.0),
                                        return_df=True)
aa.display_df(df_encoded, n_rows=10, show_shape=True)

DataFrame shape: (20, 9)

	entry	gene	sequence	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	FXYD3	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MTDH	MAARSWQDELAQQAE...SPKQIKKKKKARRET	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	PMEPA1	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P53801	PTTG1IP	MAPGVARGPTPYWRL...GLFKEENPYARFENN	97	119	RWGVCWVNFE	ALIITMSVVGGTLLLGIAICCCC	CCRRKRSRKP
5	Q8IUW5	RELL1	MAPRALPGSAVLAAA...EVPATPVKRERSGTE	59	81	NDTGNGHPEY	IAYALVPVFFIMGLFGVLICHLL	KKKGYRCTTE
6	P01135	TGFA	MVPSAGQLALFALGI...LLKGRTACCHSETVV	99	121	AVVAASQKKQ	AITALVVVSIVALAVLIITCVLI	HCCQVRKHCE
7	O43914	TYROBP	MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK	42	64	DCSCSTVSPG	VLAGIVMGDLVLTVLIALAVYFL	GRLVPRGRGA
8	P05556	ITGB1	MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK	729	751	ENPECPTGPD	IIPIVAGVVAGIVLIGLALLLIW	KLLMIIHDRR
9	P16234	PDGFRA	MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL	527	549	VAPTLRSELT	VAAAVLVLLVIVIISLIVLVVIW	KQKPRYEIRW
10	P50895	BCAM	MEPPDAPAQARGAPR...SGGARGGSGGFGDEC	549	571	TVSPQTSQAG	VAVMAVAVSVGLLLLVVAVFYCV	RRKGGPCCRQ