EmbeddingPreprocessor

class EmbeddingPreprocessor(verbose=True)[source]

Bases: object

Preprocessing class for protein language model (PLM) embeddings.

Turns raw per-residue embeddings into the [0, 1]-normalized dict_num consumed by CPP.run_num() (the primary, position-preserving path via encode()), with a secondary scale-based path (df_scales / df_cat via build_scales() / build_cat()) for CPP.run().

Added in version 1.1.0.

norm_params_

Per-dimension normalization parameters fitted by encode(); set after the first encode call so the identical transform can be reproduced.

Type:: dict

Parameters:: verbose (bool)

Methods

`build_cat`(df_scales[, df_stds, cat_min_th, ...])	Build a two-level pseudo-category table by clustering pseudo-scales via AAclust.
`build_scales`(df_seq, dict_num[, return_std])	Build pseudo-scales by context-free averaging of per-residue embeddings.
`encode`(df_seq, embeddings[, method, clip, ...])	Encode raw per-residue protein language model (PLM) embeddings into a `[0, 1]`-normalized `dict_num`.
`fetch_embeddings`(df_seq[, mode, model, ...])	Fetch and compute protein language model (PLM) embeddings for every entry.
`pool_embeddings`(embeddings[, pooling, df_seq])	Pool per-residue embeddings into one vector per protein.

__init__(verbose=True)[source]

Parameters:: verbose (bool, default=True) – If True, verbose outputs are enabled.

See also

StructurePreprocessor: the structure-side analog.
AnnotationPreprocessor: the annotation-side analog.
aaanalysis.combine_dict_nums(): stitch multiple dict_nums.
CPP.run_num(): the downstream consumer.

Examples

EmbeddingPreprocessor.encode turns raw per-residue protein-language-model (PLM) embeddings into a [0, 1]-normalized dict_num ready for :meth:CPP.run_num. You compute the embeddings externally with your model of choice (ESM, ProtT5, …); here we stand in a small random tensor. The normalizer is fit per embedding dimension over the whole corpus.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
# Replace this with real PLM output: {entry: (L, D) array}.
embeddings = {e: rng.normal(size=(len(s), 8))
              for e, s in zip(df_seq["entry"], df_seq["sequence"])}

embp = aa.EmbeddingPreprocessor()
dict_num = embp.encode(df_seq=df_seq, embeddings=embeddings, method="minmax")
arr = next(iter(dict_num.values()))
print(arr.shape, round(float(arr.min()), 3), round(float(arr.max()), 3))

(87, 8) 0.0 0.913

Every array is now in [0, 1]. Slice it with :meth:NumericalFeature.get_parts and feed :meth:CPP.run_num, or combine several sources first with :func:aaanalysis.combine_dict_nums.

# Further parameters: ``clip`` sets the percentile bounds used by the 'minmax' /
# 'quantile' normalizers, and ``return_df=True`` also returns a per-entry status
# frame alongside the dict_num.
dict_num_clip, df_encoded = embp.encode(df_seq=df_seq, embeddings=embeddings,
                                        method="minmax", clip=(2.0, 98.0),
                                        return_df=True)
aa.display_df(df_encoded, n_rows=10, show_shape=True)

DataFrame shape: (20, 9)

	entry	gene	sequence	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	FXYD3	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MTDH	MAARSWQDELAQQAE...SPKQIKKKKKARRET	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	PMEPA1	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P53801	PTTG1IP	MAPGVARGPTPYWRL...GLFKEENPYARFENN	97	119	RWGVCWVNFE	ALIITMSVVGGTLLLGIAICCCC	CCRRKRSRKP
5	Q8IUW5	RELL1	MAPRALPGSAVLAAA...EVPATPVKRERSGTE	59	81	NDTGNGHPEY	IAYALVPVFFIMGLFGVLICHLL	KKKGYRCTTE
6	P01135	TGFA	MVPSAGQLALFALGI...LLKGRTACCHSETVV	99	121	AVVAASQKKQ	AITALVVVSIVALAVLIITCVLI	HCCQVRKHCE
7	O43914	TYROBP	MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK	42	64	DCSCSTVSPG	VLAGIVMGDLVLTVLIALAVYFL	GRLVPRGRGA
8	P05556	ITGB1	MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK	729	751	ENPECPTGPD	IIPIVAGVVAGIVLIGLALLLIW	KLLMIIHDRR
9	P16234	PDGFRA	MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL	527	549	VAPTLRSELT	VAAAVLVLLVIVIISLIVLVVIW	KQKPRYEIRW
10	P50895	BCAM	MEPPDAPAQARGAPR...SGGARGGSGGFGDEC	549	571	TVSPQTSQAG	VAVMAVAVSVGLLLLVVAVFYCV	RRKGGPCCRQ