aaanalysis.AnnotationPreprocessor.build_scales

AnnotationPreprocessor.build_scales(df_seq=None, dict_num=None, features=None, return_std=False, dim_names_override=None)[source]

Build df_scales by context-free per-AA averaging of the corpus.

Mirrors StructurePreprocessor.build_scales(): for each canonical amino acid and each D dimension, the pseudo-scale entry is the mean of the normalized per-residue values over occurrences of that AA. Required so CPP.run_num()’s cor > max_cor redundancy gate is discriminative (an all-equal df_scales disables it).

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts.

  • dict_num (dict[str, np.ndarray]) – Per-residue tensors {entry: (L_entry, D) ndarray} from encode() (or combined via aaanalysis.combine_dict_nums()).

  • features (list of str) – Registry keys in the same order as the dict_num D-axis layout.

  • return_std (bool, default=False) – If True, also return per-AA standard deviations.

  • dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal D.

Returns:

  • df_scales (pd.DataFrame, shape (20, D)) – Rows are the 20 canonical AAs; columns are dim names; cells are per-AA means of normalized values (NaN where the AA is absent).

  • df_stds (pd.DataFrame, shape (20, D)) – Per-AA standard deviations, only when return_std=True.

Raises:

ValueError – On missing corpus, mismatched D, missing entries, or invalid keys.

Warning

UserWarning

Pseudo-scales depend on the content of df_seq + dict_num.