AnnotationPreprocessor.build_scales

AnnotationPreprocessor.build_scales(df_seq, dict_num, features, return_std=False, dim_names_override=None)[source]

Build df_scales by context-free per-amino acid (AA) averaging of the corpus.

Mirrors StructurePreprocessor.build_scales(): for each canonical amino acid and each D dimension, the pseudo-scale entry is the mean of the normalized per-residue values over occurrences of that AA. Required so CPP.run_num()’s cor > max_cor redundancy gate is discriminative (an all-equal df_scales disables it).

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts.
dict_num (dict[str, np.ndarray]) – Per-residue tensors {entry: (L_entry, D) ndarray} from encode() (or combined via aaanalysis.combine_dict_nums()).
features (list of str) – Registry keys in the same order as the dict_num D-axis layout.
return_std (bool, default=False) – If True, also return per-AA standard deviations.
dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal D.

Returns:

df_scales (pd.DataFrame, shape (20, D)) – Rows are the 20 canonical AAs; columns are dim names; cells are per-AA means of normalized values (NaN where the AA is absent).
df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, only when return_std=True.

Raises:

ValueError – On missing corpus, mismatched D, missing entries, or invalid keys.

Warning

UserWarning: Pseudo-scales depend on the content of df_seq + dict_num.

Examples

build_scales collapses an annotation dict_num into context-free per-amino-acid means — a df_scales of shape (20, D) — that keeps CPP.run_num’s redundancy correlation gate live and feeds the scale-based CPP.run.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

annp = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = annp.ingest(df_user)

dict_num = annp.encode(df_seq=df_seq, df_annot=df_annot, features=['hotspot'])
df_scales = annp.build_scales(df_seq=df_seq, dict_num=dict_num,
                            features=['hotspot'])
df_scales.head()

	hotspot
A	0.00
C	0.00
D	0.46
E	0.00
F	0.00

Further parameters. AnnotationPreprocessor.build_scales also accepts: return_std — If True, also return per-AA standard deviations; dim_names_override — Replacement names for the D columns; length must equal D.

# Further parameters: ``return_std=True`` also returns the per-AA standard
# deviations, and ``dim_names_override`` renames the D columns.
df_scales_named, df_scales_std = annp.build_scales(
    df_seq=df_seq, dict_num=dict_num, features=['hotspot'],
    return_std=True, dim_names_override=['hotspot_dim'])
aa.display_df(df_scales_std, n_rows=10, show_shape=True)

DataFrame shape: (20, 1)

	hotspot_dim
A	0.000000
C	0.000000
D	0.460000
E	0.000000
F	0.000000
G	0.000000
H	0.000000
I	0.000000
K	0.000000
L	0.000000