StructurePreprocessor

class StructurePreprocessor(verbose=True)[source]

Bases: object

Preprocessing class ([pro], requires aaanalysis[pro]) for protein structure features (PDB / CIF / AlphaFold).

Turns local structure files into the [0, 1]-normalized per-residue dict_num consumed by CPP.run_num(). Each encode_* method reads one kind of source from a folder and returns a (L, D) tensor per protein: Define Secondary Structure of Proteins (DSSP)-derived geometry (encode_dssp()), PDB ATOM-record features (encode_pdb()), AlphaFold Predicted Aligned Error (PAE) summaries (encode_pae()), or domain segmentation (encode_domains()). fetch_alphafold() downloads the input files first when you do not already have them locally. A secondary scale-based path (build_scales() / build_cat()) feeds the amino acid (AA)-scale CPP.run().

Added in version 1.1.0.

Parameters:

verbose (bool)

Methods

build_cat(features[, dim_names_override])

Build the df_cat metadata frame for features.

build_scales(df_seq, dict_num, features[, ...])

Build df_scales by context-free per-amino acid (AA) averaging of the encoded corpus.

encode_domains(df_seq[, domain_folder, ...])

Read pre-computed domain segmentation files into dict_domains.

encode_dssp(df_seq[, pdb_folder, ss_mode, ...])

Run Define Secondary Structure of Proteins (DSSP) and the per-feature encoders to build a [0, 1]-normalized dict_dssp.

encode_pae(df_seq, pae_folder, features[, ...])

Load AlphaFold PAE sidecar JSONs and produce dict_pae.

encode_pdb(df_seq, pdb_folder, features[, ...])

Extract per-residue features from PDB ATOM records into dict_pdb.

fetch_alphafold(df_seq, out_folder[, ...])

Download AlphaFold model + Predicted Aligned Error (PAE) files for every entry into a folder.

get_domains(df_seq[, pdb_folder, ...])

Run a domain-segmentation tool and append a chopping column.

get_dssp(df_seq, pdb_folder[, features, ...])

Run Define Secondary Structure of Proteins (DSSP) and append per-residue list columns to df_seq.

__init__(verbose=True)[source]
Parameters:

verbose (bool, default=True) – If True, verbose outputs are enabled.

Notes

  • This is the structure-side member of the per-residue dict_num family, alongside EmbeddingPreprocessor (protein language model (PLM) embeddings) and AnnotationPreprocessor (post-translational modification (PTM) / functional sites). All three emit [0, 1]-normalized tensors that NumericalFeature.get_parts() slices into the per-part inputs of CPP.run_num(), and that stack along the D axis via aaanalysis.combine_dict_nums(). The accompanying (df_scales, df_cat) pair names the D dimensions for the redundancy filter and output columns.

  • Feature value range — always normalized to ``[0, 1]`` (NaN for unresolved positions). Use the table below to de-normalize back to raw units if needed:

    The recipes are the source of truth in feature_registry.NORMALIZATION_RECIPES; this table is generated to match.

  • Feature categorization. Every feature key emits category='Structure' (the top-level redundancy / color bucket; see ut.DICT_COLOR_CAT['Structure'] = #2E6E5E deep teal-green). The fine-grained split (Secondary structure (3-state), B-factor (CA mean), AlphaFold pLDDT (raw), etc.) lives in subcategory and is what CPPPlot.feature_map displays on the y-axis. Subcategory names follow the AAontology convention (descriptive name with source / detail in parentheses). The redundancy filter’s check_cat=True arm therefore groups all Structure features into one bucket; build_scales populates df_scales so the max_cor gate can discriminate within that bucket.

  • Requires aaanalysis[pro] (biopython) plus a mkdssp / dssp binary on PATH. The depth feature additionally requires the msms binary; install via conda install -c bioconda msms.

  • Single-chain PDBs only — the chain whose ATOM sequence best matches df_seq[sequence] is selected automatically.

See also

Examples

encode_pdb extracts per-residue features straight from PDB / CIF ATOM records — AlphaFold model-file features (plddt, plddt_disorder, plddt_tier, sidechain chi*_sincos, ca_centroid_dist*, contact_count_*, hse) and experimental bfactor / depth — into a [0, 1]-normalized dict_num. Here we use the bundled AF_TINY fixture (depth is omitted as it needs the external msms binary).

import warnings
from pathlib import Path
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
stp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})

feats = ['plddt', 'plddt_disorder', 'plddt_tier',
         'contact_count_8A', 'bfactor']
dict_pdb = stp.encode_pdb(df_seq=df_seq, pdb_folder=str(PDB_FIXTURES),
                          features=feats)
arr = dict_pdb['AF_TINY']
print('shape (L, D):', arr.shape)
print('value range:', round(float(np.nanmin(arr)), 3),
      '..', round(float(np.nanmax(arr)), 3))
shape (L, D): (30, 8)
value range: 0.0 .. 1.0

Each row is a residue, each column a feature dimension in [0, 1]. Pass return_df=True for the (dict_num, df_seq_out) form whose pdb_ok column flags entries whose structure file failed to load.