aaanalysis.AnnotationPreprocessor.build_scales
- AnnotationPreprocessor.build_scales(df_seq=None, dict_num=None, features=None, return_std=False, dim_names_override=None)[source]
Build
df_scalesby context-free per-AA averaging of the corpus.Mirrors
StructurePreprocessor.build_scales(): for each canonical amino acid and each D dimension, the pseudo-scale entry is the mean of the normalized per-residue values over occurrences of that AA. Required soCPP.run_num()’scor > max_corredundancy gate is discriminative (an all-equaldf_scalesdisables it).- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Used here as the source of empirical amino-acid contexts.dict_num (dict[str, np.ndarray]) – Per-residue tensors
{entry: (L_entry, D) ndarray}fromencode()(or combined viaaaanalysis.combine_dict_nums()).features (list of str) – Registry keys in the same order as the
dict_numD-axis layout.return_std (bool, default=False) – If
True, also return per-AA standard deviations.dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal
D.
- Returns:
df_scales (pd.DataFrame, shape (20, D)) – Rows are the 20 canonical AAs; columns are dim names; cells are per-AA means of normalized values (NaN where the AA is absent).
df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, only when
return_std=True.
- Raises:
ValueError – On missing corpus, mismatched D, missing entries, or invalid keys.
Warning
- UserWarning
Pseudo-scales depend on the content of
df_seq+dict_num.