EmbeddingPreprocessor.fetch_embeddings

EmbeddingPreprocessor.fetch_embeddings(df_seq, mode='protein', model='esm2_t12_35M', pooling='mean', source='auto', batch_size=8, device='auto', max_length=None, layer=-1, allow_oversized=False, on_failure='nan', return_df=False)[source]

Fetch and compute protein language model (PLM) embeddings for every entry.

Downloads a curated model (ESM-2, ESM-1b, ProtT5, ProstT5) from the Hugging Face Hub and computes its embeddings, returning either one vector per protein (mode='protein') or a per-residue array per protein (mode='residue'). The per-residue output is the raw, unbounded {entry: (L, D)} mapping that encode() normalizes into the dict_num consumed by CPP.run_num(); the per-protein output is a redundancy-free feature matrix ready for AAclust.select_proteins() or TreeModel. Embeddings are returned raw — normalization is encode()’s job. Requires the embed extra (pip install 'aaanalysis[embed]'); the heavy dependencies are imported lazily, so the rest of the class works without them.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Output rows are aligned to df_seq.
mode ({'protein', 'residue'}, default='protein') – 'protein' returns one pooled vector per protein; 'residue' returns the per-residue (L, D) array per protein (feeds encode()).
model (str, default='esm2_t12_35M') – Registry key of the PLM to use — one of 'esm2_t6_8M', 'esm2_t12_35M', 'esm2_t30_150M', 'esm2_t33_650M', 'esm2_t36_3B', 'esm1b', 'prott5_xl_u50', 'prostt5'. See the Notes table for each model’s size, embedding dimension, and whether it runs on a typical laptop CPU. An unknown key raises a ValueError listing the valid options.
pooling ({'mean', 'max', 'cls'}, default='mean') – Residue→protein reduction for mode='protein'. 'cls' uses the model’s leading token and is only valid for models that have one (ESM, not ProtT5).
source ({'auto', 'compute'}, default='auto') – Acquisition path. Both currently compute locally; 'uniprot' (direct fetch of precomputed embeddings) is reserved for a future release.
batch_size (int, default=8) – Number of sequences embedded per forward pass.
device ({'auto', 'cpu', 'cuda', 'mps'}, default='auto') – Compute device; 'auto' picks CUDA, then Apple MPS, else CPU.
max_length (int, optional) – Truncate sequences to this many residues. Defaults to the model’s own cap (e.g. 1022 for ESM-1b); longer sequences are truncated with a warning.
layer (int, default=-1) – Hidden layer to read out; -1 is the last layer.
allow_oversized (bool, default=False) – If False, a model whose estimated memory footprint exceeds the detected device memory raises a RuntimeWarning (with a smaller-model suggestion) but still runs. True suppresses the guard.
on_failure ({'nan', 'drop', 'raise'}, default='nan') – Policy for entries that fail to embed: 'nan' keeps a NaN row/array and marks it not-ok; 'drop' removes it; 'raise' raises RuntimeError.
return_df (bool, default=False) – If True, also return an echo of df_seq with a boolean embeddings_ok column.

Returns:

embeddings (np.ndarray or dict) – np.ndarray of shape (n_samples, D) row-aligned to df_seq (mode='protein'), or {entry: (L, D)} of raw per-residue arrays (mode='residue').
df_seq_out (pd.DataFrame) – Returned only when return_df=True: an echo of df_seq plus a boolean embeddings_ok column.

Notes

Available models. Footprints are inference floors (real peak grows with batch_size × sequence length); Local marks models that run comfortably on a typical 16 GB laptop CPU. ESM-2 spans accuracy/speed trade-offs; ProtT5/ProstT5 are stronger but heavier; ProstT5 is structure-aware (trained on 3Di tokens).

model	params	dim	~RAM (CPU)	Local?	best for
`esm2_t6_8M`	8 M	320	~0.3 GB	yes	laptops, large corpora, quick tests
`esm2_t12_35M`	35 M	480	~0.5 GB	yes	default; best size/quality on CPU
`esm2_t30_150M`	150 M	640	~1.5 GB	yes	richer residue features, still CPU-fine
`esm2_t33_650M`	650 M	1280	~3 GB	slow	strong; comfortable with a GPU
`esm2_t36_3B`	3 B	2560	~12 GB	GPU	highest quality; needs a ≥12 GB GPU
`esm1b`	650 M	1280	~3 GB	slow	ESM-1b parity; 1022-residue cap
`prott5_xl_u50`	1.2 B	1024	~5 GB	GPU	ProtT5; matches UniProt’s embeddings
`prostt5`	1.2 B	1024	~5 GB	GPU	structure-aware (3Di) embeddings

The larger ESM-2 models and both T5 models are slow on CPU and may exhaust memory; fetch_embeddings emits a RuntimeWarning suggesting a smaller model when the estimated footprint exceeds the detected device memory (override with allow_oversized=True, lower batch_size, or select a GPU via device).

Compute locally vs. fetch precomputed (UniProt). fetch_embeddings computes embeddings on your machine, which works for any sequence — mutants, designs, or non-model organisms not in any database. UniProt separately publishes precomputed ProtT5 per-protein embeddings for UniProtKB/Swiss-Prot and selected reference proteomes; when your proteins are covered and ProtT5 is acceptable, downloading those (currently a bulk per-proteome file indexed by accession) avoids local compute entirely. Prefer the precomputed route for large, fully-covered Swiss-Prot sets on a CPU-only machine; compute here when proteins are novel/mutated, when you need a non-ProtT5 model (e.g. ESM-2/ProstT5), or when you want per-residue output for encode() → CPP.run_num(). A source='uniprot' path for the precomputed route is reserved for a future release.

Embedding extraction is deterministic (eval mode), so no random_state / seed is needed.
Returned embeddings are raw (unbounded) floats; pass mode='residue' output to encode() before CPP.run_num().

See also

EmbeddingPreprocessor.pool_embeddings(): pool per-residue arrays into per-protein vectors explicitly.
EmbeddingPreprocessor.encode(): normalize per-residue embeddings to [0, 1].
AAclust.select_proteins(): cluster per-protein embeddings into representatives.

Raises:

ValueError – On invalid arguments (unknown model, 'cls' pooling on a model without a CLS token, a pre-existing embeddings_ok column, …).
ImportError – If the embed extra (torch / transformers) is not installed.
RuntimeError – On an embedding failure under on_failure='raise'.

Examples

EmbeddingPreprocessor.fetch_embeddings downloads a protein language model (ESM-2, ESM-1b, ProtT5, ProstT5) from the Hugging Face Hub and computes its embeddings: one mean/max/cls-pooled vector per protein (mode='protein') or a per-residue (L, D) array per protein (mode='residue'). It requires the embed extra (pip install 'aaanalysis[embed]'); the heavy dependencies are imported lazily, so the rest of the class works without them.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

# A small dataset; the smallest model keeps this fast (weights downloaded once).
df_seq = aa.load_dataset(name="DOM_GSEC", n=2)

ep = aa.EmbeddingPreprocessor()
# Per-protein embeddings: one mean-pooled vector per protein.
X, df_out = ep.fetch_embeddings(df_seq, mode="protein", model="esm2_t6_8M",
                                pooling="mean", return_df=True)
print("per-protein matrix:", X.shape)
aa.display_df(df_out, n_rows=10, show_shape=True)

Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.

Loading weights:   0%|          | 0/102 [00:00<?, ?it/s]

per-protein matrix: (4, 320)
DataFrame shape: (4, 9)

	entry	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c	embeddings_ok
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS	True
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	0	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR	True
3	P05067	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	701	723	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH	True
4	P14925	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	868	890	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD	True

Per-residue embeddings are raw, unbounded floats; normalize them with :meth:~EmbeddingPreprocessor.encode before :meth:CPP.run_num. The per-protein matrix above feeds :meth:AAclust.select_proteins or a classifier directly.

# Per-residue embeddings: one (L, D) array per protein.
emb = ep.fetch_embeddings(df_seq, mode="residue", model="esm2_t6_8M")
for entry, arr in emb.items():
    print(entry, "->", arr.shape)

# Normalize to [0, 1] for CPP.run_num.
dict_num = ep.encode(df_seq=df_seq, embeddings=emb)
print("encoded entries:", len(dict_num))

Loading weights:   0%|          | 0/102 [00:00<?, ?it/s]

Q14802 -> (87, 320)
Q86UE4 -> (582, 320)
P05067 -> (770, 320)
P14925 -> (976, 320)
encoded entries: 4

Key parameters: model selects the PLM (esm2_t6_8M … esm2_t33_650M, prott5_xl_u50, prostt5); pooling is 'mean' / 'max' / 'cls' ('cls' only for models with a leading token, i.e. ESM not ProtT5); device is 'auto' / 'cpu' / 'cuda' / 'mps'; batch_size and max_length control throughput and truncation; on_failure ('nan' / 'drop' / 'raise') handles per-entry failures; and allow_oversized bypasses the hardware-size warning. The :meth:~EmbeddingPreprocessor.pool_embeddings helper pools per-residue arrays explicitly.

# Max-pool the per-residue arrays into a per-protein matrix.
X_max = ep.pool_embeddings(emb, pooling="max", df_seq=df_seq)
print("max-pooled matrix:", X_max.shape)

max-pooled matrix: (4, 320)