aaanalysis.EmbeddingPreprocessor

class aaanalysis.EmbeddingPreprocessor(verbose=True)[source]

Bases: object

Preprocessing class for protein language model (PLM) embeddings [Breimann25a].

Turns raw per-residue embeddings into the [0, 1]-normalized dict_num consumed by CPP.run_num() (the primary, position-preserving path via encode()), with a secondary scale-based path (df_scales / df_cat via build_scales() / build_cat()) for CPP.run().

Added in version 1.1.0.

norm_params_

Per-dimension normalization parameters fitted by encode(); set after the first encode call so the identical transform can be reproduced.

Type:

dict

Parameters:

verbose (bool)

Methods

build_cat([df_scales, df_stds, cat_min_th, ...])

Build a two-level pseudo-category table by clustering pseudo-scales via AAclust.

build_scales([df_seq, dict_num, return_std])

Build pseudo-scales by context-free averaging of per-residue embeddings.

encode([df_seq, embeddings, method, clip, ...])

Encode raw per-residue PLM embeddings into a [0, 1]-normalized dict_num.

__init__(verbose=True)[source]
Parameters:

verbose (bool, default=True) – If True, verbose outputs are enabled.

See also

Examples

EmbeddingPreprocessor.encode turns raw per-residue protein-language-model (PLM) embeddings into a [0, 1]-normalized dict_num ready for :meth:CPP.run_num. You compute the embeddings externally with your model of choice (ESM, ProtT5, …); here we stand in a small random tensor. The normalizer is fit per embedding dimension over the whole corpus.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
# Replace this with real PLM output: {entry: (L, D) array}.
embeddings = {e: rng.normal(size=(len(s), 8))
              for e, s in zip(df_seq["entry"], df_seq["sequence"])}

ep = aa.EmbeddingPreprocessor()
dict_num = ep.encode(df_seq=df_seq, embeddings=embeddings, method="minmax")
arr = next(iter(dict_num.values()))
print(arr.shape, round(float(arr.min()), 3), round(float(arr.max()), 3))
(87, 8) 0.0 0.913

Every array is now in [0, 1]. Slice it with :meth:NumericalFeature.get_parts and feed :meth:CPP.run_num, or combine several sources first with :func:aaanalysis.combine_dict_nums.