EmbeddingPreprocessor.pool_embeddings

EmbeddingPreprocessor.pool_embeddings(embeddings=None, pooling='mean', df_seq=None)[source]

Pool per-residue embeddings into one vector per protein.

Reduces a {entry: (L, D)} mapping (e.g. from fetch_embeddings(mode='residue') or your own PLM run) to one (D,) vector per protein. This is the simple statistical counterpart to the richer “pooling” that CPP.run() / CPP.run_num() perform when turning per-residue values into features.

Added in version 1.1.0.

Parameters:
  • embeddings (dict) – Mapping {entry: (L, D)} of per-residue embedding arrays.

  • pooling ({'mean', 'max'}, default='mean') – Reduction over residues. ('cls' is unavailable here — residue arrays carry no leading token; use fetch_embeddings(mode='protein', pooling='cls').)

  • df_seq (pd.DataFrame, optional) – DataFrame containing an entry column with unique protein identifiers. If given, return a (n_samples, D) matrix row-aligned to df_seq instead of a dict.

Returns:

pooled{entry: (D,)} of pooled vectors, or a (n_samples, D) matrix row-aligned to df_seq when df_seq is given.

Return type:

dict or np.ndarray

See also

Raises:

ValueError – On invalid pooling, an empty embeddings dict, or an entry in df_seq missing from embeddings.

Examples

EmbeddingPreprocessor.pool_embeddings reduces per-residue embeddings ({entry: (L, D)}, e.g. from :meth:~EmbeddingPreprocessor.fetch_embeddings with mode='residue') to one vector per protein. It is the simple statistical counterpart to the richer pooling that :meth:CPP.run / :meth:CPP.run_num perform.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

# Per-residue embeddings (here random for illustration; in practice from
# fetch_embeddings(mode='residue') or your own PLM run).
rng = np.random.default_rng(0)
df_seq = aa.load_dataset(name="DOM_GSEC", n=2)
emb = {entry: rng.standard_normal((len(seq), 8))
       for entry, seq in zip(df_seq["entry"], df_seq["sequence"])}

ep = aa.EmbeddingPreprocessor()
pooled = ep.pool_embeddings(emb, pooling="mean")
print("pooled vectors:", {k: v.shape for k, v in pooled.items()})
pooled vectors: {'Q14802': (8,), 'Q86UE4': (8,), 'P05067': (8,), 'P14925': (8,)}

Pass df_seq to get a (n_samples, D) matrix row-aligned to it (ready for :meth:AAclust.select_proteins or a classifier). pooling is 'mean' or 'max'.

X = ep.pool_embeddings(emb, pooling="max", df_seq=df_seq)
print("max-pooled matrix:", X.shape)
max-pooled matrix: (4, 8)