aaanalysis.StructurePreprocessor
- class aaanalysis.StructurePreprocessor(verbose=True)[source]
Bases:
objectPreprocessing class for protein structure features (PDB / CIF / AlphaFold) [Breimann25a].
Mirrors
EmbeddingPreprocessor’s instance-based shape but is the structure-side companion: produces thedict_numtensor thatNumericalFeature.get_parts()slices into per-part inputs forCPP.run_num(), plus the(df_scales, df_cat)metadata pair that names the D dimensions.Added in version 1.1.0.
- Parameters:
verbose (
bool)
Methods
build_cat([features, dim_names_override])Build the
df_catmetadata frame forfeatures.build_scales([df_seq, dict_num, features, ...])Build
df_scalesby context-free per-AA averaging of the encoded corpus.encode_domains([df_seq, domain_folder, ...])Read pre-computed domain segmentation files into
dict_domains.encode_dssp([df_seq, pdb_folder, features, ...])Run DSSP and the per-feature encoders to build a
[0, 1]-normalizeddict_dssp.encode_pae([df_seq, pae_folder, features, ...])Load AlphaFold PAE sidecar JSONs and produce
dict_pae.encode_pdb([df_seq, pdb_folder, features, ...])Extract per-residue features from PDB ATOM records into
dict_pdb.get_domains([df_seq, pdb_folder, ...])Run a domain-segmentation tool and append a
choppingcolumn.get_dssp([df_seq, pdb_folder, features, ...])Run DSSP and append per-residue list columns to
df_seq.- __init__(verbose=True)[source]
- Parameters:
verbose (bool, default=True) – If
True, verbose outputs are enabled.
Notes
Feature value range — always normalized to ``[0, 1]`` (NaN for unresolved positions). Use the table below to de-normalize back to raw units if needed:
The recipes are the source of truth in
feature_registry.NORMALIZATION_RECIPES; this table is generated to match.Feature categorization. Every feature key emits
category='Structure'(the top-level redundancy / color bucket; seeut.DICT_COLOR_CAT['Structure']=#2E6E5Edeep teal-green). The fine-grained split (Secondary structure (3-state),B-factor (CA mean),AlphaFold pLDDT (raw), etc.) lives insubcategoryand is whatCPPPlot.feature_mapdisplays on the y-axis. Subcategory names follow the AAontology convention (descriptive name with source / detail in parentheses). The redundancy filter’scheck_cat=Truearm therefore groups all Structure features into one bucket;build_scalespopulatesdf_scalesso themax_corgate can discriminate within that bucket.Requires
aaanalysis[pro](biopython) plus amkdssp/dsspbinary on PATH. Thedepthfeature additionally requires themsmsbinary; install viaconda install -c bioconda msms.Single-chain PDBs only — the chain whose ATOM sequence best matches
df_seq[sequence]is selected automatically.
See also
EmbeddingPreprocessor: the PLM-embedding analog.AnnotationPreprocessor: the PTM / functional-site analog.NumericalFeatureandCPP: the downstream consumers.aaanalysis.combine_dict_nums(): stitch multiple dict_nums.
Examples
Run AF model + PAE features through
CPP.run_numagainst a small synthetic AlphaFold-style fixture (AF_TINY). The fixture ships inaaanalysis/_data/pdb_test/so this notebook is self-contained and runs without any external downloads.Each encoder normalizes its output to
[0, 1]per the recipes documented inStructurePreprocessor’s class docstring.import warnings from pathlib import Path import numpy as np import pandas as pd import aaanalysis as aa aa.options['verbose'] = False warnings.filterwarnings('ignore') PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test' stp = aa.StructurePreprocessor(verbose=False) df_seq = pd.DataFrame({ 'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL'], }) df_seq
Reads pLDDT from the B-factor column of
AF_TINY.pdb; computes side-chain chi1 and CA-CA contact density; outputs normalized to[0, 1].pdb_feats = ['plddt', 'plddt_disorder', 'plddt_tier', 'chi1_sincos', 'ca_centroid_dist_norm', 'contact_count_8A'] dict_pdb = stp.encode_pdb( df_seq=df_seq, pdb_folder=str(PDB_FIXTURES), features=pdb_feats) print('dict_pdb shape per entry:', {k: v.shape for k, v in dict_pdb.items()}) print('value range:', float(np.nanmin(dict_pdb['AF_TINY'])), float(np.nanmax(dict_pdb['AF_TINY'])))
Loads
AF_TINY_pae.json(the AF predicted-aligned-error matrix); summarizes per residue as row-mean / local-mean / distal-mean / asymmetry. All normalized to[0, 1](divisor 31.75 Å for most keys, 10 Å forpae_asymmetry).import tempfile, shutil pae_dir = tempfile.mkdtemp() shutil.copy(PDB_FIXTURES / 'AF_TINY_pae.json', Path(pae_dir) / 'AF_TINY.json') pae_feats = ['pae_row_mean', 'pae_local_mean', 'pae_distal_mean', 'pae_asymmetry'] dict_pae = stp.encode_pae( df_seq=df_seq, pae_folder=pae_dir, features=pae_feats, local_window=5) print('dict_pae shape per entry:', {k: v.shape for k, v in dict_pae.items()}) print('local vs distal mean PAE on AF_TINY:', float(np.nanmean(dict_pae['AF_TINY'][:, 1])), 'vs', float(np.nanmean(dict_pae['AF_TINY'][:, 2])))
combine_dict_numsstitches the twodict_nums along the D axis.build_scalespopulates the per-AA-averaged df_scales from the user corpus — needed for the redundancy filter’smax_corarm to be meaningful.build_catproduces the (D, 5) metadata.feats = pdb_feats + pae_feats dict_num = aa.combine_dict_nums(dict_nums=[dict_pdb, dict_pae]) df_scales = stp.build_scales( df_seq=df_seq, dict_num=dict_num, features=feats) df_cat = stp.build_cat(features=feats) print('df_scales:', df_scales.shape, ' df_cat:', df_cat.shape) df_cat.head(8)
Every category resolves to the locked
Structurecolor (#2E6E5E) — that closes the v1 CPPPlot defect where unknown categories raisedValueError.import aaanalysis.utils as ut print('unique categories:', df_cat['category'].unique().tolist()) print('color for Structure:', ut.DICT_COLOR_CAT.get('Structure'))
With a real corpus this
(df_scales, df_cat, dict_num)triple plugs intoNumericalFeature.get_parts(...)andCPP.run_num(...)exactly the way the integration test intests/unit/cpp_tests/test_run_num_structural.pydoes. AF-DB bulk downloads can use the<entry>.cif.gzresolver path; PAE sidecars can use the AF-DB canonical filenameAF-<entry>-F1-predicted_aligned_error_v4.jsondirectly without renaming.