aaanalysis.EmbeddingPreprocessor.build_cat
- EmbeddingPreprocessor.build_cat(df_scales=None, df_stds=None, cat_min_th=0.5, subcat_min_th=0.7, metric='correlation', random_state=0)[source]
Build a two-level pseudo-category table by clustering pseudo-scales via AAclust.
Two independent
AAclustruns at different correlation thresholds produce coarsercatlabels and finersubcatlabels for each embedding dimension. Mirrors the AAontologydf_catschema so the result is a drop-in for thedf_catargument ofCPP.__init__().When
df_stdsis supplied, clustering becomes std-aware: each dimension is represented by the per-column z-scored concatenation of its per-AA(mean, std)(shape(D, 40)instead of(D, 20)). Two dimensions with similar per-AA means but very different per-AA stds will then not collapse into the same cluster.- Parameters:
df_scales (pd.DataFrame, shape (20, D)) – Pseudo-scale DataFrame produced by
build_scales()(or a user-supplied analog with the same shape). Must have at least 3 columns.df_stds (pd.DataFrame, shape (20, D), optional) – Per-AA standard deviations matching
df_scalesexactly in shape, index, and columns. Produce viabuild_scales(..., return_std=True). When supplied, enables std-aware clustering (see Notes); whenNone(default), mean-only clustering is used. Must contain no NaN — drop the same rows you dropped fromdf_scales.cat_min_th (float, default=0.5) – AAclust correlation threshold for the coarser (
cat) level. Lower values produce fewer, larger clusters.subcat_min_th (float, default=0.7) – AAclust correlation threshold for the finer (
subcat) level. Must be greater thancat_min_th.metric ({'correlation', 'cosine'}, default='correlation') – Distance metric forwarded to
AAclust.fit(). Controls the optional cluster-merging step and medoid selection only; the k-optimization phase is always Pearson-correlation-based.random_state (int, default=0) – Random seed threaded through AAclust for reproducible cluster IDs.
- Returns:
df_cat – Pseudo-category DataFrame with columns
scale_id,category("PLM_cat_<k>"),subcategory("PLM_subcat_<k>"),scale_name,scale_description. Thescale_idcolumn matches the column labels ofdf_scales. Drop-in for thedf_catargument ofCPP.__init__().- Return type:
pd.DataFrame, shape (D, 5)
Notes
The two AAclust runs are independent.
subcatlabels do not necessarily nest withincatlabels — they are two views over the same pseudo-scales at different correlation thresholds.The
metricparameter only affects post-hoc merging. To experiment with non-Pearson similarity during k-optimization, a deeper AAclust change is required.Std-aware recipe (when ``df_stds_emb`` is supplied). A composition of three textbook ingredients, not a single named method: (i) each dimension is represented by its per-AA
(mean, std)— the sufficient statistics of a 1-D Gaussian over that AA’s residue embeddings; (ii) per-column z-scoring across the D dimensions puts the mean half and std half on a common footing so neither dominates row-Pearson [MilliganCooper88]; (iii) AAclust then clusters the (D, 40) descriptor matrix by Pearson row-correlation, in the same tradition as gene-expression feature clustering [Eisen98]. This recipe is not a closed-form approximation of Bhattacharyya / symmetric-KL between per-AA Gaussians (under equal variance Bhattacharyya reduces to a function of(μ₁ − μ₂)²alone, which would motivate dropping the std half).
See also
AAclust: the underlying clustering algorithm.build_scales(): produces the expected input(s); passreturn_std=Trueto getdf_stdsfor std-aware mode.
Examples
EmbeddingPreprocessor.build_catclusters the pseudo-scales into a two-level pseudo-category table (adf_cat) so the embedding dimensions get coarsecategoryand finersubcategorylabels — the drop-indf_catfor :meth:CPP.run.import numpy as np import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC", n=10) rng = np.random.default_rng(0) dict_num = {e: rng.normal(size=(len(s), 8)) for e, s in zip(df_seq["entry"], df_seq["sequence"])} ep = aa.EmbeddingPreprocessor() df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num).dropna() df_cat = ep.build_cat(df_scales=df_scales, cat_min_th=0.3, subcat_min_th=0.6) df_cat.head()
/tmp/claude-501/ipykernel_85604/3735151386.py:11: UserWarning: Pseudo-scales are dataset-dependent (averaged over df_seq). For reproducible cross-dataset comparison, compute them once on a fixed reference corpus and reuse the resulting df_scales. df_scales = ep.build_scales(df_seq=df_seq, dict_num=dict_num).dropna()
scale_id category subcategory scale_name scale_description 0 dim_0 Embeddings Embeddings_cat0_subcat0 dim_0 1 dim_1 Embeddings Embeddings_cat1_subcat3 dim_1 2 dim_2 Embeddings Embeddings_cat2_subcat1 dim_2 3 dim_3 Embeddings Embeddings_cat0_subcat2 dim_3 4 dim_4 Embeddings Embeddings_cat0_subcat0 dim_4