aaanalysis.EmbeddingPreprocessor.encode

EmbeddingPreprocessor.encode(df_seq=None, embeddings=None, method='minmax', clip=(1.0, 99.0), return_df=False)[source]

Encode raw per-residue PLM embeddings into a [0, 1]-normalized dict_num.

Raw PLM embeddings (ESM, ProtT5, …) are unbounded floats, whereas CPP.run_num() expects per-residue values in [0, 1] (the same normalization convention as StructurePreprocessor and AnnotationPreprocessor). encode fits one normalizer per embedding dimension over the whole corpus (all residues of all proteins in df_seq) and applies it to every entry, returning a dict_num that feeds straight into NumericalFeature.get_parts()CPP.run_num(). The fitted parameters are stored on the instance (self.norm_params_) so the identical transform can be reproduced.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Defines which entries are encoded and validates that each embedding array’s length matches its sequence.

  • embeddings (dict[str, np.ndarray]) – Mapping from entry to a raw per-residue embedding array of shape (L, D) where L is the protein length and D is the embedding dimensionality. Every entry in df_seq must be a key; all arrays must share the same D. You compute these externally with your PLM of choice — AAanalysis does not run the model.

  • method ({'minmax', 'quantile', 'sigmoid'}, default='minmax') – Per-dimension normalization to [0, 1]. 'minmax' linearly rescales each dim between its corpus min and max; 'quantile' does the same between robust percentiles (see clip) so outlier residues do not crush the range; 'sigmoid' z-scores each dim and applies a logistic squash.

  • clip (tuple of float, default=(1.0, 99.0)) – Lower / upper percentiles used only when method='quantile'.

  • return_df (bool, default=False) – If True, also return an echo of df_seq as a second element.

Returns:

Notes

  • The normalizer is fit over the supplied corpus, so it is dataset-dependent: the same embeddings normalized against a different df_seq yield different values. For reproducible cross-dataset comparison, fit once on a fixed reference corpus.

See also

Examples

EmbeddingPreprocessor.encode turns raw per-residue protein-language-model (PLM) embeddings into a [0, 1]-normalized dict_num ready for :meth:CPP.run_num. You compute the embeddings externally with your model of choice (ESM, ProtT5, …); here we stand in a small random tensor. The normalizer is fit per embedding dimension over the whole corpus.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
# Replace this with real PLM output: {entry: (L, D) array}.
embeddings = {e: rng.normal(size=(len(s), 8))
              for e, s in zip(df_seq["entry"], df_seq["sequence"])}

ep = aa.EmbeddingPreprocessor()
dict_num = ep.encode(df_seq=df_seq, embeddings=embeddings, method="minmax")
arr = next(iter(dict_num.values()))
print(arr.shape, round(float(arr.min()), 3), round(float(arr.max()), 3))
(87, 8) 0.0 0.913

Every array is now in [0, 1]. Slice it with :meth:NumericalFeature.get_parts and feed :meth:CPP.run_num, or combine several sources first with :func:aaanalysis.combine_dict_nums.