EmbeddingPreprocessor.fetch_embeddings
- EmbeddingPreprocessor.fetch_embeddings(df_seq=None, mode='protein', model='esm2_t12_35M', pooling='mean', source='auto', batch_size=8, device='auto', max_length=None, layer=-1, allow_oversized=False, on_failure='nan', return_df=False)[source]
Fetch and compute protein language model (PLM) embeddings for every entry.
Downloads a curated model (ESM-2, ESM-1b, ProtT5, ProstT5) from the Hugging Face Hub and computes its embeddings, returning either one vector per protein (
mode='protein') or a per-residue array per protein (mode='residue'). The per-residue output is the raw, unbounded{entry: (L, D)}mapping thatencode()normalizes into thedict_numconsumed byCPP.run_num(); the per-protein output is a redundancy-free feature matrix ready forAAclust.select_proteins()orTreeModel. Embeddings are returned raw — normalization isencode()’s job. Requires theembedextra (pip install 'aaanalysis[embed]'); the heavy dependencies are imported lazily, so the rest of the class works without them.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Output rows are aligned todf_seq.mode ({'protein', 'residue'}, default='protein') –
'protein'returns one pooled vector per protein;'residue'returns the per-residue(L, D)array per protein (feedsencode()).model (str, default='esm2_t12_35M') – Registry key of the PLM to use — one of
'esm2_t6_8M','esm2_t12_35M','esm2_t30_150M','esm2_t33_650M','esm2_t36_3B','esm1b','prott5_xl_u50','prostt5'. See the Notes table for each model’s size, embedding dimension, and whether it runs on a typical laptop CPU. An unknown key raises aValueErrorlisting the valid options.pooling ({'mean', 'max', 'cls'}, default='mean') – Residue→protein reduction for
mode='protein'.'cls'uses the model’s leading token and is only valid for models that have one (ESM, not ProtT5).source ({'auto', 'compute'}, default='auto') – Acquisition path. Both currently compute locally;
'uniprot'(direct fetch of precomputed embeddings) is reserved for a future release.batch_size (int, default=8) – Number of sequences embedded per forward pass.
device ({'auto', 'cpu', 'cuda', 'mps'}, default='auto') – Compute device;
'auto'picks CUDA, then Apple MPS, else CPU.max_length (int, optional) – Truncate sequences to this many residues. Defaults to the model’s own cap (e.g. 1022 for ESM-1b); longer sequences are truncated with a warning.
layer (int, default=-1) – Hidden layer to read out;
-1is the last layer.allow_oversized (bool, default=False) – If
False, a model whose estimated memory footprint exceeds the detected device memory raises aRuntimeWarning(with a smaller-model suggestion) but still runs.Truesuppresses the guard.on_failure ({'nan', 'drop', 'raise'}, default='nan') – Policy for entries that fail to embed:
'nan'keeps a NaN row/array and marks it not-ok;'drop'removes it;'raise'raisesRuntimeError.return_df (bool, default=False) – If
True, also return an echo ofdf_seqwith a booleanembeddings_okcolumn.
- Returns:
embeddings (np.ndarray or dict) –
np.ndarrayof shape(n_samples, D)row-aligned todf_seq(mode='protein'), or{entry: (L, D)}of raw per-residue arrays (mode='residue').df_seq_out (pd.DataFrame) – Returned only when
return_df=True: an echo ofdf_seqplus a booleanembeddings_okcolumn.
Notes
Available models. Footprints are inference floors (real peak grows with
batch_size× sequence length); Local marks models that run comfortably on a typical 16 GB laptop CPU. ESM-2 spans accuracy/speed trade-offs; ProtT5/ProstT5 are stronger but heavier; ProstT5 is structure-aware (trained on 3Di tokens).model
params
dim
~RAM (CPU)
Local?
best for
esm2_t6_8M8 M
320
~0.3 GB
yes
laptops, large corpora, quick tests
esm2_t12_35M35 M
480
~0.5 GB
yes
default; best size/quality on CPU
esm2_t30_150M150 M
640
~1.5 GB
yes
richer residue features, still CPU-fine
esm2_t33_650M650 M
1280
~3 GB
slow
strong; comfortable with a GPU
esm2_t36_3B3 B
2560
~12 GB
GPU
highest quality; needs a ≥12 GB GPU
esm1b650 M
1280
~3 GB
slow
ESM-1b parity; 1022-residue cap
prott5_xl_u501.2 B
1024
~5 GB
GPU
ProtT5; matches UniProt’s embeddings
prostt51.2 B
1024
~5 GB
GPU
structure-aware (3Di) embeddings
The larger ESM-2 models and both T5 models are slow on CPU and may exhaust memory;
fetch_embeddingsemits aRuntimeWarningsuggesting a smaller model when the estimated footprint exceeds the detected device memory (override withallow_oversized=True, lowerbatch_size, or select a GPU viadevice).Compute locally vs. fetch precomputed (UniProt).
fetch_embeddingscomputes embeddings on your machine, which works for any sequence — mutants, designs, or non-model organisms not in any database. UniProt separately publishes precomputed ProtT5 per-protein embeddings for UniProtKB/Swiss-Prot and selected reference proteomes; when your proteins are covered and ProtT5 is acceptable, downloading those (currently a bulk per-proteome file indexed by accession) avoids local compute entirely. Prefer the precomputed route for large, fully-covered Swiss-Prot sets on a CPU-only machine; compute here when proteins are novel/mutated, when you need a non-ProtT5 model (e.g. ESM-2/ProstT5), or when you want per-residue output forencode()→CPP.run_num(). Asource='uniprot'path for the precomputed route is reserved for a future release.Embedding extraction is deterministic (eval mode), so no
random_state/seedis needed.Returned embeddings are raw (unbounded) floats; pass
mode='residue'output toencode()beforeCPP.run_num().
See also
EmbeddingPreprocessor.pool_embeddings(): pool per-residue arrays into per-protein vectors explicitly.EmbeddingPreprocessor.encode(): normalize per-residue embeddings to[0, 1].AAclust.select_proteins(): cluster per-protein embeddings into representatives.
- Raises:
ValueError – On invalid arguments (unknown
model,'cls'pooling on a model without a CLS token, a pre-existingembeddings_okcolumn, …).ImportError – If the
embedextra (torch/transformers) is not installed.RuntimeError – On an embedding failure under
on_failure='raise'.
Examples
EmbeddingPreprocessor.fetch_embeddingsdownloads a protein language model (ESM-2, ESM-1b, ProtT5, ProstT5) from the Hugging Face Hub and computes its embeddings: one mean/max/cls-pooled vector per protein (mode='protein') or a per-residue(L, D)array per protein (mode='residue'). It requires theembedextra (pip install 'aaanalysis[embed]'); the heavy dependencies are imported lazily, so the rest of the class works without them.import numpy as np import aaanalysis as aa aa.options["verbose"] = False # A small dataset; the smallest model keeps this fast (weights downloaded once). df_seq = aa.load_dataset(name="DOM_GSEC", n=2) ep = aa.EmbeddingPreprocessor() # Per-protein embeddings: one mean-pooled vector per protein. X, df_out = ep.fetch_embeddings(df_seq, mode="protein", model="esm2_t6_8M", pooling="mean", return_df=True) print("per-protein matrix:", X.shape) aa.display_df(df_out, n_rows=10, show_shape=True)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 0%| | 0/102 [00:00<?, ?it/s]
per-protein matrix: (4, 320) DataFrame shape: (4, 9)
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c embeddings_ok 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS True 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR True 3 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH True 4 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD True Per-residue embeddings are raw, unbounded floats; normalize them with :meth:
~EmbeddingPreprocessor.encodebefore :meth:CPP.run_num. The per-protein matrix above feeds :meth:AAclust.select_proteinsor a classifier directly.# Per-residue embeddings: one (L, D) array per protein. emb = ep.fetch_embeddings(df_seq, mode="residue", model="esm2_t6_8M") for entry, arr in emb.items(): print(entry, "->", arr.shape) # Normalize to [0, 1] for CPP.run_num. dict_num = ep.encode(df_seq=df_seq, embeddings=emb) print("encoded entries:", len(dict_num))
Loading weights: 0%| | 0/102 [00:00<?, ?it/s]
Q14802 -> (87, 320) Q86UE4 -> (582, 320) P05067 -> (770, 320) P14925 -> (976, 320) encoded entries: 4
Key parameters:
modelselects the PLM (esm2_t6_8M…esm2_t33_650M,prott5_xl_u50,prostt5);poolingis'mean'/'max'/'cls'('cls'only for models with a leading token, i.e. ESM not ProtT5);deviceis'auto'/'cpu'/'cuda'/'mps';batch_sizeandmax_lengthcontrol throughput and truncation;on_failure('nan'/'drop'/'raise') handles per-entry failures; andallow_oversizedbypasses the hardware-size warning. The :meth:~EmbeddingPreprocessor.pool_embeddingshelper pools per-residue arrays explicitly.# Max-pool the per-residue arrays into a per-protein matrix. X_max = ep.pool_embeddings(emb, pooling="max", df_seq=df_seq) print("max-pooled matrix:", X_max.shape)
max-pooled matrix: (4, 320)