AnnotationPreprocessor.build_cat

AnnotationPreprocessor.build_cat(features, dim_names_override=None)[source]

Build the df_cat metadata frame for features (corpus-free).

df_cat[category] is 'PTMs' or 'Functional sites'; df_cat[subcategory] carries the per-key semantic split.

Added in version 1.1.0.

Parameters:
  • features (list of str) – Registry keys, in the order they appear along the D axis.

  • dim_names_override (list of str, optional) – Replacement names for the D columns.

Returns:

df_cat – One row per dimension: scale_id, category, subcategory, scale_name, scale_description.

Return type:

pd.DataFrame, shape (D, 5)

Raises:

ValueError – On invalid or unregistered feature keys in features.

Examples

build_cat returns the corpus-free df_cat metadata tagging each annotation dimension with its category (PTMs for the closed UniProt vocabulary, Functional sites for the open one) and locked color — the drop-in df_cat for CPP.run_num.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

ap = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = ap.ingest(df_user)

df_cat = ap.build_cat(features=['hotspot'])
print('category:', df_cat[ut.COL_CAT].iloc[0],
      '| color:', ut.DICT_COLOR_CAT[df_cat[ut.COL_CAT].iloc[0]])
df_cat
category: Functional sites | color: #2C6E9E
scale_id category subcategory scale_name scale_description
0 hotspot Functional sites FUNC_hotspot hotspot Functional sites/FUNC_hotspot

Further parameters. AnnotationPreprocessor.build_cat also accepts: dim_names_override — Replacement names for the D columns.