aaanalysis.StructurePreprocessor.build_scales
- StructurePreprocessor.build_scales(df_seq=None, dict_num=None, features=None, return_std=False, dim_names_override=None)[source]
Build
df_scalesby context-free per-AA averaging of the encoded corpus.Mirrors
EmbeddingPreprocessor.build_scales(): for each canonical amino acidaand each D dimensiond, the pseudo-scale entry is the mean ofdict_num[entry][i, d]over all (entry, i) pairs wheredf_seq[sequence][entry][i] == a. Non-canonical residues are skipped; AAs absent from the corpus get NaN rows.This is the dataset-dependent step. The values feed
CPP.run_num()’s redundancy filter (df_scales.corr()arm); a meaningful corpus is required to makemax_cordiscriminative. Compute pseudo-scales once on a fixed reference corpus and reuse for cross-dataset comparability.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. Used here as the source of empirical amino-acid contexts.dict_num (dict[str, np.ndarray]) – Combined per-residue tensors
{entry: (L_entry, D_total) ndarray}— typically the output ofaaanalysis.combine_dict_nums(). Every entry indf_seqmust be a key; per entry,L_entry == len(sequence);D_totalmust equalsum(REGISTRY[f]['num_dims'] for f in features)(i.e. the encoder outputs in feature-key order).features (list of str) – Feature keys from the StructurePreprocessor registry in the same order as the
dict_numD-axis layout. Used to name the D dimensions of the output and to validateD_total.return_std (bool, default=False) – If
True, also return per-AA standard deviations in a second DataFrame of the same shape. AAs occurring exactly once receive std=0; AAs absent from the corpus receive NaN.dim_names_override (list of str, optional) – Replacement names for the D columns; length must equal
D_total.Noneuses the registry default names.
- Returns:
df_scales (pd.DataFrame, shape (20, D_total)) – Pseudo-scale DataFrame. Rows are the 20 canonical AAs (
ut.LIST_CANONICAL_AA); columns are dim names. Cells are context-free per-AA means of normalized encoder outputs (each in[0, 1]); NaN where the AA is absent from the corpus.df_stds (pd.DataFrame, shape (20, D_total)) – Per-AA standard deviations, returned only when
return_std=True.
- Raises:
ValueError – On missing
df_seq/dict_num, mismatched D, missing entries, or invalid feature keys.
Warning
- UserWarning
Pseudo-scales depend on the content of
df_seq+dict_num.