aaanalysis.EmbeddingPreprocessor.encode
- EmbeddingPreprocessor.encode(df_seq=None, embeddings=None, method='minmax', clip=(1.0, 99.0), return_df=False)[source]
Encode raw per-residue PLM embeddings into a
[0, 1]-normalizeddict_num.Raw PLM embeddings (ESM, ProtT5, …) are unbounded floats, whereas
CPP.run_num()expects per-residue values in[0, 1](the same normalization convention asStructurePreprocessorandAnnotationPreprocessor).encodefits one normalizer per embedding dimension over the whole corpus (all residues of all proteins indf_seq) and applies it to every entry, returning adict_numthat feeds straight intoNumericalFeature.get_parts()→CPP.run_num(). The fitted parameters are stored on the instance (self.norm_params_) so the identical transform can be reproduced.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Defines which entries are encoded and validates that each embedding array’s length matches its sequence.embeddings (dict[str, np.ndarray]) – Mapping from entry to a raw per-residue embedding array of shape
(L, D)whereLis the protein length andDis the embedding dimensionality. Every entry indf_seqmust be a key; all arrays must share the sameD. You compute these externally with your PLM of choice — AAanalysis does not run the model.method ({'minmax', 'quantile', 'sigmoid'}, default='minmax') – Per-dimension normalization to
[0, 1].'minmax'linearly rescales each dim between its corpus min and max;'quantile'does the same between robust percentiles (seeclip) so outlier residues do not crush the range;'sigmoid'z-scores each dim and applies a logistic squash.clip (tuple of float, default=(1.0, 99.0)) – Lower / upper percentiles used only when
method='quantile'.return_df (bool, default=False) – If
True, also return an echo ofdf_seqas a second element.
- Returns:
dict_num (dict[str, np.ndarray]) –
{entry: (L, D) ndarray}with all values in[0, 1], same shape asembeddings. Stack with other per-residue tensors viaaaanalysis.combine_dict_nums(), slice withNumericalFeature.get_parts(), then runCPP.run_num().df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seq.
Notes
The normalizer is fit over the supplied corpus, so it is dataset-dependent: the same embeddings normalized against a different
df_seqyield different values. For reproducible cross-dataset comparison, fit once on a fixed reference corpus.
See also
build_scales(): the secondary context-free AA-scale path (for CPP.run).CPP.run_num(): the per-residue consumer of the returned dict_num.aaanalysis.combine_dict_nums(): stitch several dict_nums together.
Examples
EmbeddingPreprocessor.encodeturns raw per-residue protein-language-model (PLM) embeddings into a[0, 1]-normalizeddict_numready for :meth:CPP.run_num. You compute the embeddings externally with your model of choice (ESM, ProtT5, …); here we stand in a small random tensor. The normalizer is fit per embedding dimension over the whole corpus.import numpy as np import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC", n=10) rng = np.random.default_rng(0) # Replace this with real PLM output: {entry: (L, D) array}. embeddings = {e: rng.normal(size=(len(s), 8)) for e, s in zip(df_seq["entry"], df_seq["sequence"])} ep = aa.EmbeddingPreprocessor() dict_num = ep.encode(df_seq=df_seq, embeddings=embeddings, method="minmax") arr = next(iter(dict_num.values())) print(arr.shape, round(float(arr.min()), 3), round(float(arr.max()), 3))
(87, 8) 0.0 0.913
Every array is now in
[0, 1]. Slice it with :meth:NumericalFeature.get_partsand feed :meth:CPP.run_num, or combine several sources first with :func:aaanalysis.combine_dict_nums.