AnnotationPreprocessor.build_scales
- AnnotationPreprocessor.build_scales(df_seq, dict_num, features, return_std=False, dim_names_override=None)[source]
Build
df_scalesby context-free per-amino acid (AA) averaging of the corpus.Mirrors
StructurePreprocessor.build_scales(): for each canonical amino acid and each D dimension, the pseudo-scale entry is the mean of the normalized per-residue values over occurrences of that AA. Required soCPP.run_num()’scor > max_corredundancy gate is discriminative (an all-equaldf_scalesdisables it).Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Used here as the source of empirical amino-acid contexts.dict_num (dict[str, np.ndarray]) – Per-residue tensors
{entry: (L_entry, D) ndarray}fromencode()(or combined viaaaanalysis.combine_dict_nums()).features (list of str) – Registry keys in the same order as the
dict_numD-axis layout.return_std (bool, default=False) – If
True, also return per-AA standard deviations.dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal
D.
- Returns:
df_scales (pd.DataFrame, shape (20, D)) – Rows are the 20 canonical AAs; columns are dim names; cells are per-AA means of normalized values (NaN where the AA is absent).
df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, only when
return_std=True.
- Raises:
ValueError – On missing corpus, mismatched D, missing entries, or invalid keys.
Warning
- UserWarning
Pseudo-scales depend on the content of
df_seq+dict_num.
Examples
build_scalescollapses an annotationdict_numinto context-free per-amino-acid means — adf_scalesof shape(20, D)— that keepsCPP.run_num’s redundancy correlation gate live and feeds the scale-basedCPP.run.import warnings import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') ap = aa.AnnotationPreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) # A small user/predictor table -> Functional sites (open vocabulary). df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'], ut.COL_START: [3, 16], ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'], ut.COL_SCORE: [0.92, 0.40]}) df_annot = ap.ingest(df_user) dict_num = ap.encode(df_seq=df_seq, df_annot=df_annot, features=['hotspot']) df_scales = ap.build_scales(df_seq=df_seq, dict_num=dict_num, features=['hotspot']) df_scales.head()
hotspot A 0.00 C 0.00 D 0.46 E 0.00 F 0.00 Further parameters.
AnnotationPreprocessor.build_scalesalso accepts:return_std— IfTrue, also return per-AA standard deviations;dim_names_override— Replacement names for the D columns; length must equalD.