StructurePreprocessor

class StructurePreprocessor(verbose=True)[source]

Bases: object

Preprocessing class ([pro], requires aaanalysis[pro]) for protein structure features (PDB / CIF / AlphaFold).

Turns local structure files into the [0, 1]-normalized per-residue dict_num consumed by CPP.run_num(). Each encode_* method reads one kind of source from a folder and returns a (L, D) tensor per protein: Define Secondary Structure of Proteins (DSSP)-derived geometry (encode_dssp()), PDB ATOM-record features (encode_pdb()), AlphaFold Predicted Aligned Error (PAE) summaries (encode_pae()), or domain segmentation (encode_domains()). fetch_alphafold() downloads the input files first when you do not already have them locally. A secondary scale-based path (build_scales() / build_cat()) feeds the amino acid (AA)-scale CPP.run().

Added in version 1.1.0.

Parameters:: verbose (bool)

Methods

`build_cat`(features[, dim_names_override])	Build the `df_cat` metadata frame for `features`.
`build_scales`(df_seq, dict_num, features[, ...])	Build `df_scales` by context-free per-amino acid (AA) averaging of the encoded corpus.
`encode`(df_seq, *, features[, pdb_folder, ...])	Encode a mixed feature list, routing each key to the right backend.
`encode_domains`(df_seq[, domain_folder, ...])	Read pre-computed domain segmentation files into `dict_domains`.
`encode_dssp`(df_seq[, pdb_folder, ss_mode, ...])	Run Define Secondary Structure of Proteins (DSSP) and the per-feature encoders to build a `[0, 1]`-normalized `dict_dssp`.
`encode_pae`(df_seq, pae_folder, features[, ...])	Load AlphaFold PAE sidecar JSONs and produce `dict_pae`.
`encode_pdb`(df_seq, pdb_folder, features[, ...])	Extract per-residue features from PDB ATOM records into `dict_pdb`.
`fetch_alphafold`(df_seq, out_folder[, ...])	Download AlphaFold model + Predicted Aligned Error (PAE) files for every entry into a folder.
`get_domains`(df_seq[, pdb_folder, ...])	Run a domain-segmentation tool and append a `chopping` column.
`get_dssp`(df_seq, pdb_folder[, features, ...])	Run Define Secondary Structure of Proteins (DSSP) and append per-residue list columns to `df_seq`.

__init__(verbose=True)[source]

Parameters:: verbose (bool, default=True) – If True, verbose outputs are enabled.

Notes

This is the structure-side member of the per-residue dict_num family, alongside EmbeddingPreprocessor (protein language model (PLM) embeddings) and AnnotationPreprocessor (post-translational modification (PTM) / functional sites). All three emit [0, 1]-normalized tensors that NumericalFeature.get_parts() slices into the per-part inputs of CPP.run_num(), and that stack along the D axis via aaanalysis.combine_dict_nums(). The accompanying (df_scales, df_cat) pair names the D dimensions for the redundancy filter and output columns.

Feature value range, always normalized to ``[0, 1]`` (NaN for unresolved positions). Use the table below to de-normalize back to raw units if needed:

Feature key	Raw range	Recipe → normalized	Inverse (de-normalize)
`ss3` / `ss8`	{0, 1} (one-hot)	identity	identity
`rasa`	[0, ~1.2]	`clip(x, 0, 1)`	identity (clipped)
`phi_psi_sincos`	[-1, 1]	`(x + 1) / 2`	`x * 2 - 1` (in [-1, 1])
`bfactor`	[0, 100+] Å²	`clip(x / 100, 0, 1)`	`x * 100` (lossy when ≥1)
`depth`	[0, ~15] Å	`clip(x / 15, 0, 1)`	`x * 15` (lossy when ≥1)
`plddt`	[0, 100]	`x / 100`	`x * 100`
`plddt_disorder`	{0, 1}	identity	identity
`plddt_tier`	{0, 1} (4-dim one-hot)	identity	identity
`chi1_sincos` / `chi2_sincos`	[-1, 1]	`(x + 1) / 2`	`x * 2 - 1` (in [-1, 1])
`ca_centroid_dist`	[0, ~40] Å	`clip(x / 40, 0, 1)`	`x * 40` (lossy when ≥1)
`ca_centroid_dist_norm`	[0, ~2] (Rg units)	`clip(x / 2, 0, 1)`	`x * 2` (lossy when ≥1)
`contact_count_8A`	[0, ~30]	`clip(x / 30, 0, 1)`	`x * 30` (lossy when ≥1)
`contact_count_12A`	[0, ~80]	`clip(x / 80, 0, 1)`	`x * 80` (lossy when ≥1)
`hse`	[0, ~30]	`clip(x / 30, 0, 1)`	`x * 30` (lossy when ≥1)
`pae_row_*` / `pae_local_mean` / `pae_distal_mean` / `pae_band_means`	[0, 31.75] Å	`clip(x / 31.75, 0, 1)`	`x * 31.75`
`pae_asymmetry`	[0, ~10] Å	`clip(x / 10, 0, 1)`	`x * 10` (lossy when ≥1)

The recipes are the source of truth in feature_registry.NORMALIZATION_RECIPES; this table is generated to match.

Feature categorization. Every feature key emits category='Structure' (the top-level redundancy / color bucket; see ut.DICT_COLOR_CAT['Structure'] = #2E6E5E deep teal-green). The fine-grained split (Secondary structure (3-state), B-factor (CA mean), AlphaFold pLDDT (raw), etc.) lives in subcategory and is what CPPPlot.feature_map displays on the y-axis. Subcategory names follow the AAontology convention (descriptive name with source / detail in parentheses). The redundancy filter’s check_cat=True arm therefore groups all Structure features into one bucket; build_scales populates df_scales so the max_cor gate can discriminate within that bucket.
Requires aaanalysis[pro] (biopython) plus a mkdssp / dssp binary on PATH. The depth feature additionally requires the msms binary; install via conda install -c bioconda msms.
Single-chain PDBs only — the chain whose ATOM sequence best matches df_seq[sequence] is selected automatically.

See also

EmbeddingPreprocessor: the PLM-embedding analog.
AnnotationPreprocessor: the PTM / functional-site analog.
NumericalFeature and CPP: the downstream consumers.
aaanalysis.combine_dict_nums(): stitch multiple dict_nums.

Examples

encode_pdb extracts per-residue features straight from PDB / CIF ATOM records — AlphaFold model-file features (plddt, plddt_disorder, plddt_tier, sidechain chi*_sincos, ca_centroid_dist*, contact_count_*, hse) and experimental bfactor / depth — into a [0, 1]-normalized dict_num. Here we use the bundled AF_TINY fixture (depth is omitted as it needs the external msms binary).

import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
strp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})

feats = ['plddt', 'plddt_disorder', 'plddt_tier',
         'contact_count_8A', 'bfactor']
dict_pdb = strp.encode_pdb(df_seq=df_seq, pdb_folder=str(PDB_FIXTURES),
                          features=feats)
arr = dict_pdb['AF_TINY']
print('shape (L, D):', arr.shape)
print('value range:', round(float(np.nanmin(arr)), 3),
      '..', round(float(np.nanmax(arr)), 3))

shape (L, D): (30, 8)
value range: 0.0 .. 1.0

Each row is a residue, each column a feature dimension in [0, 1]. Pass return_df=True for the (dict_num, df_seq_out) form whose pdb_ok column flags entries whose structure file failed to load.

# Further parameters: ``plddt_disorder_threshold`` sets the pLDDT below which a
# residue counts as disordered, ``on_failure`` governs unreadable files, and
# ``return_df=True`` also returns a per-row status frame (``pdb_ok``).
dict_pdb_thr, df_pdb_status = strp.encode_pdb(
    df_seq=df_seq, pdb_folder=str(PDB_FIXTURES),
    features=['plddt', 'plddt_disorder'],
    plddt_disorder_threshold=50.0, on_failure='nan', return_df=True)
aa.display_df(df_pdb_status, n_rows=10, show_shape=True)

DataFrame shape: (1, 3)

	entry	sequence	pdb_ok
1	AF_TINY	ACDEFGHIKLMNPQRSTVWYACDEFGHIKL	True