aaanalysis.StructurePreprocessor.build_scales

StructurePreprocessor.build_scales(df_seq=None, dict_num=None, features=None, return_std=False, dim_names_override=None)[source]

Build df_scales by context-free per-AA averaging of the encoded corpus.

Mirrors EmbeddingPreprocessor.build_scales(): for each canonical amino acid a and each D dimension d, the pseudo-scale entry is the mean of dict_num[entry][i, d] over all (entry, i) pairs where df_seq[sequence][entry][i] == a. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.

This is the dataset-dependent step. The values feed CPP.run_num()’s redundancy filter (df_scales.corr() arm); a meaningful corpus is required to make max_cor discriminative. Compute pseudo-scales once on a fixed reference corpus and reuse for cross-dataset comparability.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. Used here as the source of empirical amino-acid contexts.

  • dict_num (dict[str, np.ndarray]) – Combined per-residue tensors {entry: (L_entry, D_total) ndarray} — typically the output of aaanalysis.combine_dict_nums(). Every entry in df_seq must be a key; per entry, L_entry == len(sequence); D_total must equal sum(REGISTRY[f]['num_dims'] for f in features) (i.e. the encoder outputs in feature-key order).

  • features (list of str) – Feature keys from the StructurePreprocessor registry in the same order as the dict_num D-axis layout. Used to name the D dimensions of the output and to validate D_total.

  • return_std (bool, default=False) – If True, also return per-AA standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.

  • dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal D_total. None uses the registry default names.

Returns:

  • df_scales (pd.DataFrame, shape (20, D_total)) – Pseudo-scale DataFrame. Rows are the 20 canonical AAs (ut.LIST_CANONICAL_AA); columns are dim names. Cells are context-free per-AA means of normalized encoder outputs (each in [0, 1]); NaN where the AA is absent from the corpus.

  • df_stds (pd.DataFrame, shape (20, D_total)) – Per-AA standard deviations, returned only when return_std=True.

Raises:

ValueError – On missing df_seq / dict_num, mismatched D, missing entries, or invalid feature keys.

Warning

UserWarning

Pseudo-scales depend on the content of df_seq + dict_num.