aaanalysis.StructurePreprocessor

class aaanalysis.StructurePreprocessor(verbose=True)[source]

Bases: object

Preprocessing class for protein structure features (PDB / CIF / AlphaFold) [Breimann25a].

Mirrors EmbeddingPreprocessor’s instance-based shape but is the structure-side companion: produces the dict_num tensor that NumericalFeature.get_parts() slices into per-part inputs for CPP.run_num(), plus the (df_scales, df_cat) metadata pair that names the D dimensions.

Added in version 1.1.0.

Parameters:

verbose (bool)

Methods

build_cat([features, dim_names_override])

Build the df_cat metadata frame for features.

build_scales([df_seq, dict_num, features, ...])

Build df_scales by context-free per-AA averaging of the encoded corpus.

encode_domains([df_seq, domain_folder, ...])

Read pre-computed domain segmentation files into dict_domains.

encode_dssp([df_seq, pdb_folder, features, ...])

Run DSSP and the per-feature encoders to build a [0, 1]-normalized dict_dssp.

encode_pae([df_seq, pae_folder, features, ...])

Load AlphaFold PAE sidecar JSONs and produce dict_pae.

encode_pdb([df_seq, pdb_folder, features, ...])

Extract per-residue features from PDB ATOM records into dict_pdb.

get_domains([df_seq, pdb_folder, ...])

Run a domain-segmentation tool and append a chopping column.

get_dssp([df_seq, pdb_folder, features, ...])

Run DSSP and append per-residue list columns to df_seq.

__init__(verbose=True)[source]
Parameters:

verbose (bool, default=True) – If True, verbose outputs are enabled.

Notes

  • Feature value range — always normalized to ``[0, 1]`` (NaN for unresolved positions). Use the table below to de-normalize back to raw units if needed:

    The recipes are the source of truth in feature_registry.NORMALIZATION_RECIPES; this table is generated to match.

  • Feature categorization. Every feature key emits category='Structure' (the top-level redundancy / color bucket; see ut.DICT_COLOR_CAT['Structure'] = #2E6E5E deep teal-green). The fine-grained split (Secondary structure (3-state), B-factor (CA mean), AlphaFold pLDDT (raw), etc.) lives in subcategory and is what CPPPlot.feature_map displays on the y-axis. Subcategory names follow the AAontology convention (descriptive name with source / detail in parentheses). The redundancy filter’s check_cat=True arm therefore groups all Structure features into one bucket; build_scales populates df_scales so the max_cor gate can discriminate within that bucket.

  • Requires aaanalysis[pro] (biopython) plus a mkdssp / dssp binary on PATH. The depth feature additionally requires the msms binary; install via conda install -c bioconda msms.

  • Single-chain PDBs only — the chain whose ATOM sequence best matches df_seq[sequence] is selected automatically.

See also

Examples

Run AF model + PAE features through CPP.run_num against a small synthetic AlphaFold-style fixture (AF_TINY). The fixture ships in aaanalysis/_data/pdb_test/ so this notebook is self-contained and runs without any external downloads.

Each encoder normalizes its output to [0, 1] per the recipes documented in StructurePreprocessor’s class docstring.

import warnings
from pathlib import Path

import numpy as np
import pandas as pd
import aaanalysis as aa

aa.options['verbose'] = False
warnings.filterwarnings('ignore')

PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test'
stp = aa.StructurePreprocessor(verbose=False)
df_seq = pd.DataFrame({
    'entry':     ['AF_TINY'],
    'sequence':  ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL'],
})
df_seq

Reads pLDDT from the B-factor column of AF_TINY.pdb; computes side-chain chi1 and CA-CA contact density; outputs normalized to [0, 1].

pdb_feats = ['plddt', 'plddt_disorder', 'plddt_tier',
             'chi1_sincos', 'ca_centroid_dist_norm', 'contact_count_8A']
dict_pdb = stp.encode_pdb(
    df_seq=df_seq, pdb_folder=str(PDB_FIXTURES),
    features=pdb_feats)
print('dict_pdb shape per entry:', {k: v.shape for k, v in dict_pdb.items()})
print('value range:', float(np.nanmin(dict_pdb['AF_TINY'])),
      float(np.nanmax(dict_pdb['AF_TINY'])))

Loads AF_TINY_pae.json (the AF predicted-aligned-error matrix); summarizes per residue as row-mean / local-mean / distal-mean / asymmetry. All normalized to [0, 1] (divisor 31.75 Å for most keys, 10 Å for pae_asymmetry).

import tempfile, shutil
pae_dir = tempfile.mkdtemp()
shutil.copy(PDB_FIXTURES / 'AF_TINY_pae.json', Path(pae_dir) / 'AF_TINY.json')

pae_feats = ['pae_row_mean', 'pae_local_mean', 'pae_distal_mean',
             'pae_asymmetry']
dict_pae = stp.encode_pae(
    df_seq=df_seq, pae_folder=pae_dir,
    features=pae_feats, local_window=5)
print('dict_pae shape per entry:', {k: v.shape for k, v in dict_pae.items()})
print('local vs distal mean PAE on AF_TINY:',
      float(np.nanmean(dict_pae['AF_TINY'][:, 1])), 'vs',
      float(np.nanmean(dict_pae['AF_TINY'][:, 2])))

combine_dict_nums stitches the two dict_nums along the D axis. build_scales populates the per-AA-averaged df_scales from the user corpus — needed for the redundancy filter’s max_cor arm to be meaningful. build_cat produces the (D, 5) metadata.

feats = pdb_feats + pae_feats
dict_num = aa.combine_dict_nums(dict_nums=[dict_pdb, dict_pae])
df_scales = stp.build_scales(
    df_seq=df_seq, dict_num=dict_num, features=feats)
df_cat = stp.build_cat(features=feats)
print('df_scales:', df_scales.shape, ' df_cat:', df_cat.shape)
df_cat.head(8)

Every category resolves to the locked Structure color (#2E6E5E) — that closes the v1 CPPPlot defect where unknown categories raised ValueError.

import aaanalysis.utils as ut
print('unique categories:', df_cat['category'].unique().tolist())
print('color for Structure:', ut.DICT_COLOR_CAT.get('Structure'))

With a real corpus this (df_scales, df_cat, dict_num) triple plugs into NumericalFeature.get_parts(...) and CPP.run_num(...) exactly the way the integration test in tests/unit/cpp_tests/test_run_num_structural.py does. AF-DB bulk downloads can use the <entry>.cif.gz resolver path; PAE sidecars can use the AF-DB canonical filename AF-<entry>-F1-predicted_aligned_error_v4.json directly without renaming.