AnnotationPreprocessor.build_scales

AnnotationPreprocessor.build_scales(df_seq, dict_num, features, return_std=False, dim_names_override=None)[source]

Build df_scales by context-free per-amino acid (AA) averaging of the corpus.

Mirrors StructurePreprocessor.build_scales(): for each canonical amino acid and each D dimension, the pseudo-scale entry is the mean of the normalized per-residue values over occurrences of that AA. Required so CPP.run_num()’s cor > max_cor redundancy gate is discriminative (an all-equal df_scales disables it).

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts.

  • dict_num (dict[str, np.ndarray]) – Per-residue tensors {entry: (L_entry, D) ndarray} from encode() (or combined via aaanalysis.combine_dict_nums()).

  • features (list of str) – Registry keys in the same order as the dict_num D-axis layout.

  • return_std (bool, default=False) – If True, also return per-AA standard deviations.

  • dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal D.

Returns:

  • df_scales (pd.DataFrame, shape (20, D)) – Rows are the 20 canonical AAs; columns are dim names; cells are per-AA means of normalized values (NaN where the AA is absent).

  • df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, only when return_std=True.

Raises:

ValueError – On missing corpus, mismatched D, missing entries, or invalid keys.

Warning

UserWarning

Pseudo-scales depend on the content of df_seq + dict_num.

Examples

build_scales collapses an annotation dict_num into context-free per-amino-acid means — a df_scales of shape (20, D) — that keeps CPP.run_num’s redundancy correlation gate live and feeds the scale-based CPP.run.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

ap = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = ap.ingest(df_user)

dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot, features=['hotspot'])
df_scales = ap.build_scales(df_seq=df_seq, dict_num=dict_num,
                            features=['hotspot'])
df_scales.head()
hotspot
A 0.00
C 0.00
D 0.46
E 0.00
F 0.00

Further parameters. AnnotationPreprocessor.build_scales also accepts: return_std — If True, also return per-AA standard deviations; dim_names_override — Replacement names for the D columns; length must equal D.