EmbeddingPreprocessor.fetch_embeddings

EmbeddingPreprocessor.fetch_embeddings(df_seq=None, mode='protein', model='esm2_t12_35M', pooling='mean', source='auto', batch_size=8, device='auto', max_length=None, layer=-1, allow_oversized=False, on_failure='nan', return_df=False)[source]

Fetch and compute protein language model (PLM) embeddings for every entry.

Downloads a curated model (ESM-2, ESM-1b, ProtT5, ProstT5) from the Hugging Face Hub and computes its embeddings, returning either one vector per protein (mode='protein') or a per-residue array per protein (mode='residue'). The per-residue output is the raw, unbounded {entry: (L, D)} mapping that encode() normalizes into the dict_num consumed by CPP.run_num(); the per-protein output is a redundancy-free feature matrix ready for AAclust.select_proteins() or TreeModel. Embeddings are returned raw — normalization is encode()’s job. Requires the embed extra (pip install 'aaanalysis[embed]'); the heavy dependencies are imported lazily, so the rest of the class works without them.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Output rows are aligned to df_seq.

  • mode ({'protein', 'residue'}, default='protein') – 'protein' returns one pooled vector per protein; 'residue' returns the per-residue (L, D) array per protein (feeds encode()).

  • model (str, default='esm2_t12_35M') – Registry key of the PLM to use — one of 'esm2_t6_8M', 'esm2_t12_35M', 'esm2_t30_150M', 'esm2_t33_650M', 'esm2_t36_3B', 'esm1b', 'prott5_xl_u50', 'prostt5'. See the Notes table for each model’s size, embedding dimension, and whether it runs on a typical laptop CPU. An unknown key raises a ValueError listing the valid options.

  • pooling ({'mean', 'max', 'cls'}, default='mean') – Residue→protein reduction for mode='protein'. 'cls' uses the model’s leading token and is only valid for models that have one (ESM, not ProtT5).

  • source ({'auto', 'compute'}, default='auto') – Acquisition path. Both currently compute locally; 'uniprot' (direct fetch of precomputed embeddings) is reserved for a future release.

  • batch_size (int, default=8) – Number of sequences embedded per forward pass.

  • device ({'auto', 'cpu', 'cuda', 'mps'}, default='auto') – Compute device; 'auto' picks CUDA, then Apple MPS, else CPU.

  • max_length (int, optional) – Truncate sequences to this many residues. Defaults to the model’s own cap (e.g. 1022 for ESM-1b); longer sequences are truncated with a warning.

  • layer (int, default=-1) – Hidden layer to read out; -1 is the last layer.

  • allow_oversized (bool, default=False) – If False, a model whose estimated memory footprint exceeds the detected device memory raises a RuntimeWarning (with a smaller-model suggestion) but still runs. True suppresses the guard.

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – Policy for entries that fail to embed: 'nan' keeps a NaN row/array and marks it not-ok; 'drop' removes it; 'raise' raises RuntimeError.

  • return_df (bool, default=False) – If True, also return an echo of df_seq with a boolean embeddings_ok column.

Returns:

  • embeddings (np.ndarray or dict) – np.ndarray of shape (n_samples, D) row-aligned to df_seq (mode='protein'), or {entry: (L, D)} of raw per-residue arrays (mode='residue').

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True: an echo of df_seq plus a boolean embeddings_ok column.

Notes

Available models. Footprints are inference floors (real peak grows with batch_size × sequence length); Local marks models that run comfortably on a typical 16 GB laptop CPU. ESM-2 spans accuracy/speed trade-offs; ProtT5/ProstT5 are stronger but heavier; ProstT5 is structure-aware (trained on 3Di tokens).

model

params

dim

~RAM (CPU)

Local?

best for

esm2_t6_8M

8 M

320

~0.3 GB

yes

laptops, large corpora, quick tests

esm2_t12_35M

35 M

480

~0.5 GB

yes

default; best size/quality on CPU

esm2_t30_150M

150 M

640

~1.5 GB

yes

richer residue features, still CPU-fine

esm2_t33_650M

650 M

1280

~3 GB

slow

strong; comfortable with a GPU

esm2_t36_3B

3 B

2560

~12 GB

GPU

highest quality; needs a ≥12 GB GPU

esm1b

650 M

1280

~3 GB

slow

ESM-1b parity; 1022-residue cap

prott5_xl_u50

1.2 B

1024

~5 GB

GPU

ProtT5; matches UniProt’s embeddings

prostt5

1.2 B

1024

~5 GB

GPU

structure-aware (3Di) embeddings

The larger ESM-2 models and both T5 models are slow on CPU and may exhaust memory; fetch_embeddings emits a RuntimeWarning suggesting a smaller model when the estimated footprint exceeds the detected device memory (override with allow_oversized=True, lower batch_size, or select a GPU via device).

Compute locally vs. fetch precomputed (UniProt). fetch_embeddings computes embeddings on your machine, which works for any sequence — mutants, designs, or non-model organisms not in any database. UniProt separately publishes precomputed ProtT5 per-protein embeddings for UniProtKB/Swiss-Prot and selected reference proteomes; when your proteins are covered and ProtT5 is acceptable, downloading those (currently a bulk per-proteome file indexed by accession) avoids local compute entirely. Prefer the precomputed route for large, fully-covered Swiss-Prot sets on a CPU-only machine; compute here when proteins are novel/mutated, when you need a non-ProtT5 model (e.g. ESM-2/ProstT5), or when you want per-residue output for encode()CPP.run_num(). A source='uniprot' path for the precomputed route is reserved for a future release.

  • Embedding extraction is deterministic (eval mode), so no random_state / seed is needed.

  • Returned embeddings are raw (unbounded) floats; pass mode='residue' output to encode() before CPP.run_num().

See also

Raises:
  • ValueError – On invalid arguments (unknown model, 'cls' pooling on a model without a CLS token, a pre-existing embeddings_ok column, …).

  • ImportError – If the embed extra (torch / transformers) is not installed.

  • RuntimeError – On an embedding failure under on_failure='raise'.

Examples

EmbeddingPreprocessor.fetch_embeddings downloads a protein language model (ESM-2, ESM-1b, ProtT5, ProstT5) from the Hugging Face Hub and computes its embeddings: one mean/max/cls-pooled vector per protein (mode='protein') or a per-residue (L, D) array per protein (mode='residue'). It requires the embed extra (pip install 'aaanalysis[embed]'); the heavy dependencies are imported lazily, so the rest of the class works without them.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

# A small dataset; the smallest model keeps this fast (weights downloaded once).
df_seq = aa.load_dataset(name="DOM_GSEC", n=2)

ep = aa.EmbeddingPreprocessor()
# Per-protein embeddings: one mean-pooled vector per protein.
X, df_out = ep.fetch_embeddings(df_seq, mode="protein", model="esm2_t6_8M",
                                pooling="mean", return_df=True)
print("per-protein matrix:", X.shape)
aa.display_df(df_out, n_rows=10, show_shape=True)
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights:   0%|          | 0/102 [00:00<?, ?it/s]
per-protein matrix: (4, 320)
DataFrame shape: (4, 9)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c embeddings_ok
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS True
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR True
3 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH True
4 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD True

Per-residue embeddings are raw, unbounded floats; normalize them with :meth:~EmbeddingPreprocessor.encode before :meth:CPP.run_num. The per-protein matrix above feeds :meth:AAclust.select_proteins or a classifier directly.

# Per-residue embeddings: one (L, D) array per protein.
emb = ep.fetch_embeddings(df_seq, mode="residue", model="esm2_t6_8M")
for entry, arr in emb.items():
    print(entry, "->", arr.shape)

# Normalize to [0, 1] for CPP.run_num.
dict_num = ep.encode(df_seq=df_seq, embeddings=emb)
print("encoded entries:", len(dict_num))
Loading weights:   0%|          | 0/102 [00:00<?, ?it/s]
Q14802 -> (87, 320)
Q86UE4 -> (582, 320)
P05067 -> (770, 320)
P14925 -> (976, 320)
encoded entries: 4

Key parameters: model selects the PLM (esm2_t6_8Mesm2_t33_650M, prott5_xl_u50, prostt5); pooling is 'mean' / 'max' / 'cls' ('cls' only for models with a leading token, i.e. ESM not ProtT5); device is 'auto' / 'cpu' / 'cuda' / 'mps'; batch_size and max_length control throughput and truncation; on_failure ('nan' / 'drop' / 'raise') handles per-entry failures; and allow_oversized bypasses the hardware-size warning. The :meth:~EmbeddingPreprocessor.pool_embeddings helper pools per-residue arrays explicitly.

# Max-pool the per-residue arrays into a per-protein matrix.
X_max = ep.pool_embeddings(emb, pooling="max", df_seq=df_seq)
print("max-pooled matrix:", X_max.shape)
max-pooled matrix: (4, 320)