aaanalysis.EmbeddingPreprocessor
- class aaanalysis.EmbeddingPreprocessor(verbose=True)[source]
Bases:
objectPreprocessing class for protein language model (PLM) embeddings [Breimann25a].
Turns raw per-residue embeddings into the
[0, 1]-normalizeddict_numconsumed byCPP.run_num()(the primary, position-preserving path viaencode()), with a secondary scale-based path (df_scales/df_catviabuild_scales()/build_cat()) forCPP.run().Added in version 1.1.0.
- norm_params_
Per-dimension normalization parameters fitted by
encode(); set after the firstencodecall so the identical transform can be reproduced.- Type:
- Parameters:
verbose (
bool)
Methods
build_cat([df_scales, df_stds, cat_min_th, ...])Build a two-level pseudo-category table by clustering pseudo-scales via AAclust.
build_scales([df_seq, dict_num, return_std])Build pseudo-scales by context-free averaging of per-residue embeddings.
encode([df_seq, embeddings, method, clip, ...])Encode raw per-residue PLM embeddings into a
[0, 1]-normalizeddict_num.- __init__(verbose=True)[source]
- Parameters:
verbose (bool, default=True) – If
True, verbose outputs are enabled.
See also
StructurePreprocessor: the structure-side analog.AnnotationPreprocessor: the annotation-side analog.aaanalysis.combine_dict_nums(): stitch multiple dict_nums.CPP.run_num(): the downstream consumer.
Examples
EmbeddingPreprocessor.encodeturns raw per-residue protein-language-model (PLM) embeddings into a[0, 1]-normalizeddict_numready for :meth:CPP.run_num. You compute the embeddings externally with your model of choice (ESM, ProtT5, …); here we stand in a small random tensor. The normalizer is fit per embedding dimension over the whole corpus.import numpy as np import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC", n=10) rng = np.random.default_rng(0) # Replace this with real PLM output: {entry: (L, D) array}. embeddings = {e: rng.normal(size=(len(s), 8)) for e, s in zip(df_seq["entry"], df_seq["sequence"])} ep = aa.EmbeddingPreprocessor() dict_num = ep.encode(df_seq=df_seq, embeddings=embeddings, method="minmax") arr = next(iter(dict_num.values())) print(arr.shape, round(float(arr.min()), 3), round(float(arr.max()), 3))
(87, 8) 0.0 0.913
Every array is now in
[0, 1]. Slice it with :meth:NumericalFeature.get_partsand feed :meth:CPP.run_num, or combine several sources first with :func:aaanalysis.combine_dict_nums.