EmbeddingPreprocessor.pool_embeddings
- EmbeddingPreprocessor.pool_embeddings(embeddings=None, pooling='mean', df_seq=None)[source]
Pool per-residue embeddings into one vector per protein.
Reduces a
{entry: (L, D)}mapping (e.g. fromfetch_embeddings(mode='residue')or your own PLM run) to one(D,)vector per protein. This is the simple statistical counterpart to the richer “pooling” thatCPP.run()/CPP.run_num()perform when turning per-residue values into features.Added in version 1.1.0.
- Parameters:
embeddings (dict) – Mapping
{entry: (L, D)}of per-residue embedding arrays.pooling ({'mean', 'max'}, default='mean') – Reduction over residues. (
'cls'is unavailable here — residue arrays carry no leading token; usefetch_embeddings(mode='protein', pooling='cls').)df_seq (pd.DataFrame, optional) – DataFrame containing an
entrycolumn with unique protein identifiers. If given, return a(n_samples, D)matrix row-aligned todf_seqinstead of a dict.
- Returns:
pooled –
{entry: (D,)}of pooled vectors, or a(n_samples, D)matrix row-aligned todf_seqwhendf_seqis given.- Return type:
dict or np.ndarray
See also
EmbeddingPreprocessor.fetch_embeddings(): obtain the per-residue arrays.
- Raises:
ValueError – On invalid
pooling, an emptyembeddingsdict, or an entry indf_seqmissing fromembeddings.
Examples
EmbeddingPreprocessor.pool_embeddingsreduces per-residue embeddings ({entry: (L, D)}, e.g. from :meth:~EmbeddingPreprocessor.fetch_embeddingswithmode='residue') to one vector per protein. It is the simple statistical counterpart to the richer pooling that :meth:CPP.run/ :meth:CPP.run_numperform.import numpy as np import aaanalysis as aa aa.options["verbose"] = False # Per-residue embeddings (here random for illustration; in practice from # fetch_embeddings(mode='residue') or your own PLM run). rng = np.random.default_rng(0) df_seq = aa.load_dataset(name="DOM_GSEC", n=2) emb = {entry: rng.standard_normal((len(seq), 8)) for entry, seq in zip(df_seq["entry"], df_seq["sequence"])} ep = aa.EmbeddingPreprocessor() pooled = ep.pool_embeddings(emb, pooling="mean") print("pooled vectors:", {k: v.shape for k, v in pooled.items()})
pooled vectors: {'Q14802': (8,), 'Q86UE4': (8,), 'P05067': (8,), 'P14925': (8,)}
Pass
df_seqto get a(n_samples, D)matrix row-aligned to it (ready for :meth:AAclust.select_proteinsor a classifier).poolingis'mean'or'max'.X = ep.pool_embeddings(emb, pooling="max", df_seq=df_seq) print("max-pooled matrix:", X.shape)
max-pooled matrix: (4, 8)