StructurePreprocessor.build_cat

StructurePreprocessor.build_cat(features, dim_names_override=None)[source]

Build the df_cat metadata frame for features.

Pure registry lookup — corpus-free. df_cat[category] is always 'Structure' for every StructurePreprocessor feature; the per-key semantics live in df_cat[subcategory] (see registry).

Added in version 1.1.0.

Parameters:
  • features (list of str) – Feature keys from the StructurePreprocessor registry, in the order they appear along the D axis of the encoder outputs.

  • dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal the total dimensionality across features.

Returns:

df_cat – One row per dimension: scale_id, category, subcategory, scale_name, scale_description. category is the top-level color/redundancy-bucket bucket; subcategory carries the fine-grained semantic split ('DSSP_SS_3state', 'Flexibility_bfactor', etc.).

Return type:

pd.DataFrame, shape (D_total, 5)

See also

Examples

build_cat returns the corpus-free df_cat metadata that names each structure dimension with its category (always Structure, the locked redundancy / color bucket) and a descriptive subcategory for the CPPPlot.feature_map y-axis — the drop-in df_cat for CPP.run_num.

import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
stp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})

df_cat = stp.build_cat(features=['plddt', 'contact_count_8A', 'bfactor'])
print('categories:', df_cat[ut.COL_CAT].unique().tolist())
df_cat
categories: ['Structure']
scale_id category subcategory scale_name scale_description
0 plddt Structure AlphaFold pLDDT (raw) plddt Structure/AlphaFold pLDDT (raw)
1 contacts_8A Structure CA-CA contacts (8 A) contacts_8A Structure/CA-CA contacts (8 A)
2 bfactor Structure B-factor (CA mean) bfactor Structure/B-factor (CA mean)

Further parameters. StructurePreprocessor.build_cat also accepts: dim_names_override — Replacement names for the D columns; length must equal the total dimensionality across features.