EmbeddingPreprocessor.build_cat

EmbeddingPreprocessor.build_cat(df_scales, df_stds=None, cat_min_th=0.5, subcat_min_th=0.7, metric='correlation', random_state=0)[source]

Build a two-level pseudo-category table by clustering pseudo-scales via AAclust.

Two independent AAclust runs at different correlation thresholds produce coarser cat labels and finer subcat labels for each embedding dimension. Mirrors the AAontology df_cat schema so the result is a drop-in for the df_cat argument of CPP.__init__().

When df_stds is supplied, clustering becomes std-aware: each dimension is represented by the per-column z-scored concatenation of its per-amino acid (AA) (mean, std) (shape (D, 40) instead of (D, 20)). Two dimensions with similar per-AA means but very different per-AA stds will then not collapse into the same cluster.

Added in version 1.1.0.

Parameters:

df_scales (pd.DataFrame, shape (20, D)) – Pseudo-scale DataFrame produced by build_scales() (or a user-supplied analog with the same shape). Must have at least 3 columns.
df_stds (pd.DataFrame, shape (20, D), optional) – Per-AA standard deviations matching df_scales exactly in shape, index, and columns. Produce via build_scales(..., return_std=True). When supplied, enables std-aware clustering (see Notes); when None (default), mean-only clustering is used. Must contain no NaN — drop the same rows you dropped from df_scales.
cat_min_th (float, default=0.5) – AAclust correlation threshold for the coarser (cat) level. Lower values produce fewer, larger clusters.
subcat_min_th (float, default=0.7) – AAclust correlation threshold for the finer (subcat) level. Must be greater than cat_min_th.
metric ({'correlation', 'cosine'}, default='correlation') – Distance metric forwarded to AAclust.fit(). Controls the optional cluster-merging step and medoid selection only; the k-optimization phase is always Pearson-correlation-based.
random_state (int, default=0) – Random seed threaded through AAclust for reproducible cluster IDs.

Returns:

df_cat – Pseudo-category DataFrame with columns scale_id, category ("PLM_cat_<k>"), subcategory ("PLM_subcat_<k>"), scale_name, scale_description. The scale_id column matches the column labels of df_scales. Drop-in for the df_cat argument of CPP.__init__().

Return type:

pd.DataFrame, shape (D, 5)

Notes

The two AAclust runs are independent. subcat labels do not necessarily nest within cat labels — they are two views over the same pseudo-scales at different correlation thresholds.
The metric parameter only affects post-hoc merging. To experiment with non-Pearson similarity during k-optimization, a deeper AAclust change is required.
Std-aware recipe (when ``df_stds_emb`` is supplied). A composition of three textbook ingredients, not a single named method: (i) each dimension is represented by its per-AA (mean, std) — the sufficient statistics of a 1-D Gaussian over that AA’s residue embeddings; (ii) per-column z-scoring across the D dimensions puts the mean half and std half on a common footing so neither dominates row-Pearson [MilliganCooper88]; (iii) AAclust then clusters the (D, 40) descriptor matrix by Pearson row-correlation, in the same tradition as gene-expression feature clustering [Eisen98]. This recipe is not a closed-form approximation of Bhattacharyya / symmetric-KL between per-AA Gaussians (under equal variance Bhattacharyya reduces to a function of (μ₁ − μ₂)² alone, which would motivate dropping the std half).

See also

AAclust: the underlying clustering algorithm.
build_scales(): produces the expected input(s); pass return_std=True to get df_stds for std-aware mode.

Examples

EmbeddingPreprocessor.build_cat clusters the pseudo-scales into a two-level pseudo-category table (a df_cat) so the embedding dimensions get coarse category and finer subcategory labels — the drop-in df_cat for :meth:CPP.run.

import numpy as np
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=10)
rng = np.random.default_rng(0)
dict_num = {e: rng.normal(size=(len(s), 8))
            for e, s in zip(df_seq["entry"], df_seq["sequence"])}

embp = aa.EmbeddingPreprocessor()
df_scales = embp.build_scales(df_seq=df_seq, dict_num=dict_num).dropna()
df_cat = embp.build_cat(df_scales=df_scales, cat_min_th=0.3, subcat_min_th=0.6)
df_cat.head()

/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_53486/1739999633.py:11: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales = embp.build_scales(df_seq=df_seq, dict_num=dict_num).dropna()

	scale_id	category	subcategory	scale_name
0	dim_0	Embeddings	Embeddings_cat0_subcat0	dim_0
1	dim_1	Embeddings	Embeddings_cat1_subcat3	dim_1
2	dim_2	Embeddings	Embeddings_cat2_subcat1	dim_2
3	dim_3	Embeddings	Embeddings_cat0_subcat2	dim_3
4	dim_4	Embeddings	Embeddings_cat0_subcat0	dim_4

Further parameters. EmbeddingPreprocessor.build_cat also accepts: df_stds — Per-AA standard deviations matching df_scales exactly in shape, index, and columns; metric — Distance metric forwarded to :meth:AAclust.fit; random_state — Random seed threaded through AAclust for reproducible cluster IDs.

# Further parameters: pass matching per-AA std devs (``df_stds``) alongside
# ``df_scales``, choose the redundancy ``metric``, and fix ``random_state`` for
# reproducible AAclust cluster IDs.
df_scales_full, df_stds = embp.build_scales(df_seq=df_seq, dict_num=dict_num,
                                            return_std=True)
df_scales_full = df_scales_full.dropna()
df_stds = df_stds.loc[df_scales_full.index]
df_cat_repro = embp.build_cat(df_scales=df_scales_full, df_stds=df_stds,
                              cat_min_th=0.3, subcat_min_th=0.6,
                              metric='cosine', random_state=42)
aa.display_df(df_cat_repro, n_rows=10, show_shape=True)

DataFrame shape: (8, 5)

/var/folders/sv/65tlch_10198qgmpwcp6408r0000gn/T/ipykernel_53486/3676833243.py:4: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales.
  df_scales_full, df_stds = embp.build_scales(df_seq=df_seq, dict_num=dict_num,

	scale_id	category	subcategory	scale_name
1	dim_0	Embeddings	Embeddings_cat0_subcat2	dim_0
2	dim_1	Embeddings	Embeddings_cat1_subcat3	dim_1
3	dim_2	Embeddings	Embeddings_cat0_subcat0	dim_2
4	dim_3	Embeddings	Embeddings_cat0_subcat0	dim_3
5	dim_4	Embeddings	Embeddings_cat0_subcat4	dim_4
6	dim_5	Embeddings	Embeddings_cat1_subcat1	dim_5
7	dim_6	Embeddings	Embeddings_cat1_subcat5	dim_6
8	dim_7	Embeddings	Embeddings_cat1_subcat1	dim_7