aaanalysis.EmbeddingPreprocessor.build_cat

EmbeddingPreprocessor.build_cat(df_scales=None, df_stds=None, cat_min_th=0.5, subcat_min_th=0.7, metric='correlation', random_state=0)[source]

Build a two-level pseudo-category table by clustering pseudo-scales via AAclust.

Two independent AAclust runs at different correlation thresholds produce coarser cat labels and finer subcat labels for each embedding dimension. Mirrors the AAontology df_cat schema so the result is a drop-in for the df_cat argument of CPP.__init__().

When df_stds is supplied, clustering becomes std-aware: each dimension is represented by the per-column z-scored concatenation of its per-AA (mean, std) (shape (D, 40) instead of (D, 20)). Two dimensions with similar per-AA means but very different per-AA stds will then not collapse into the same cluster.

Parameters:
  • df_scales (pd.DataFrame, shape (20, D)) – Pseudo-scale DataFrame produced by build_scales() (or a user-supplied analog with the same shape). Must have at least 3 columns.

  • df_stds (pd.DataFrame, shape (20, D), optional) – Per-AA standard deviations matching df_scales exactly in shape, index, and columns. Produce via build_scales(..., return_std=True). When supplied, enables std-aware clustering (see Notes); when None (default), mean-only clustering is used. Must contain no NaN — drop the same rows you dropped from df_scales.

  • cat_min_th (float, default=0.5) – AAclust correlation threshold for the coarser (cat) level. Lower values produce fewer, larger clusters.

  • subcat_min_th (float, default=0.7) – AAclust correlation threshold for the finer (subcat) level. Must be greater than cat_min_th.

  • metric ({'correlation', 'cosine'}, default='correlation') – Distance metric forwarded to AAclust.fit(). Controls the optional cluster-merging step and medoid selection only; the k-optimization phase is always Pearson-correlation-based.

  • random_state (int, default=0) – Random seed threaded through AAclust for reproducible cluster IDs.

Returns:

df_cat – Pseudo-category DataFrame with columns scale_id, category ("PLM_cat_<k>"), subcategory ("PLM_subcat_<k>"), scale_name, scale_description. The scale_id column matches the column labels of df_scales. Drop-in for the df_cat argument of CPP.__init__().

Return type:

pd.DataFrame, shape (D, 5)

Notes

  • The two AAclust runs are independent. subcat labels do not necessarily nest within cat labels — they are two views over the same pseudo-scales at different correlation thresholds.

  • The metric parameter only affects post-hoc merging. To experiment with non-Pearson similarity during k-optimization, a deeper AAclust change is required.

  • Std-aware recipe (when ``df_stds_emb`` is supplied). A composition of three textbook ingredients, not a single named method: (i) each dimension is represented by its per-AA (mean, std) — the sufficient statistics of a 1-D Gaussian over that AA’s residue embeddings; (ii) per-column z-scoring across the D dimensions puts the mean half and std half on a common footing so neither dominates row-Pearson [MilliganCooper88]; (iii) AAclust then clusters the (D, 40) descriptor matrix by Pearson row-correlation, in the same tradition as gene-expression feature clustering [Eisen98]. This recipe is not a closed-form approximation of Bhattacharyya / symmetric-KL between per-AA Gaussians (under equal variance Bhattacharyya reduces to a function of (μ₁ μ₂)² alone, which would motivate dropping the std half).

See also

  • AAclust: the underlying clustering algorithm.

  • build_scales(): produces the expected input(s); pass return_std=True to get df_stds for std-aware mode.

Examples

EmbeddingPreprocessor.build_cat clusters the pseudo-scales into a two-level pseudo-category table (a df_cat) so the embedding dimensions get coarse category and finer subcategory labels — the drop-in df_cat for :meth:CPP.run.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
dict_num = {e: rng.normal(size=(len(s), 8))
            for e, s in zip(df_seq["entry"], df_seq["sequence"])}

ep = aa.EmbeddingPreprocessor()
df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num).dropna()
df_cat = ep.build_cat(df_scales=df_scales, cat_min_th=0.3, subcat_min_th=0.6)
df_cat.head()
/tmp/claude-501/ipykernel_85604/3735151386.py:11: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num).dropna()
scale_id category subcategory scale_name scale_description
0 dim_0 Embeddings Embeddings_cat0_subcat0 dim_0
1 dim_1 Embeddings Embeddings_cat1_subcat3 dim_1
2 dim_2 Embeddings Embeddings_cat2_subcat1 dim_2
3 dim_3 Embeddings Embeddings_cat0_subcat2 dim_3
4 dim_4 Embeddings Embeddings_cat0_subcat0 dim_4