StructurePreprocessor
- class StructurePreprocessor(verbose=True)[source]
Bases:
objectPreprocessing class ([pro], requires
aaanalysis[pro]) for protein structure features (PDB / CIF / AlphaFold).Turns local structure files into the
[0, 1]-normalized per-residuedict_numconsumed byCPP.run_num(). Eachencode_*method reads one kind of source from a folder and returns a(L, D)tensor per protein: Define Secondary Structure of Proteins (DSSP)-derived geometry (encode_dssp()), PDB ATOM-record features (encode_pdb()), AlphaFold Predicted Aligned Error (PAE) summaries (encode_pae()), or domain segmentation (encode_domains()).fetch_alphafold()downloads the input files first when you do not already have them locally. A secondary scale-based path (build_scales()/build_cat()) feeds the amino acid (AA)-scaleCPP.run().Added in version 1.1.0.
- Parameters:
verbose (
bool)
Methods
build_cat(features[, dim_names_override])Build the
df_catmetadata frame forfeatures.build_scales(df_seq, dict_num, features[, ...])Build
df_scalesby context-free per-amino acid (AA) averaging of the encoded corpus.encode_domains(df_seq[, domain_folder, ...])Read pre-computed domain segmentation files into
dict_domains.encode_dssp(df_seq[, pdb_folder, ss_mode, ...])Run Define Secondary Structure of Proteins (DSSP) and the per-feature encoders to build a
[0, 1]-normalizeddict_dssp.encode_pae(df_seq, pae_folder, features[, ...])Load AlphaFold PAE sidecar JSONs and produce
dict_pae.encode_pdb(df_seq, pdb_folder, features[, ...])Extract per-residue features from PDB ATOM records into
dict_pdb.fetch_alphafold(df_seq, out_folder[, ...])Download AlphaFold model + Predicted Aligned Error (PAE) files for every entry into a folder.
get_domains(df_seq[, pdb_folder, ...])Run a domain-segmentation tool and append a
choppingcolumn.get_dssp(df_seq, pdb_folder[, features, ...])Run Define Secondary Structure of Proteins (DSSP) and append per-residue list columns to
df_seq.- __init__(verbose=True)[source]
- Parameters:
verbose (bool, default=True) – If
True, verbose outputs are enabled.
Notes
This is the structure-side member of the per-residue
dict_numfamily, alongsideEmbeddingPreprocessor(protein language model (PLM) embeddings) andAnnotationPreprocessor(post-translational modification (PTM) / functional sites). All three emit[0, 1]-normalized tensors thatNumericalFeature.get_parts()slices into the per-part inputs ofCPP.run_num(), and that stack along the D axis viaaaanalysis.combine_dict_nums(). The accompanying(df_scales, df_cat)pair names the D dimensions for the redundancy filter and output columns.Feature value range — always normalized to ``[0, 1]`` (NaN for unresolved positions). Use the table below to de-normalize back to raw units if needed:
The recipes are the source of truth in
feature_registry.NORMALIZATION_RECIPES; this table is generated to match.Feature categorization. Every feature key emits
category='Structure'(the top-level redundancy / color bucket; seeut.DICT_COLOR_CAT['Structure']=#2E6E5Edeep teal-green). The fine-grained split (Secondary structure (3-state),B-factor (CA mean),AlphaFold pLDDT (raw), etc.) lives insubcategoryand is whatCPPPlot.feature_mapdisplays on the y-axis. Subcategory names follow the AAontology convention (descriptive name with source / detail in parentheses). The redundancy filter’scheck_cat=Truearm therefore groups all Structure features into one bucket;build_scalespopulatesdf_scalesso themax_corgate can discriminate within that bucket.Requires
aaanalysis[pro](biopython) plus amkdssp/dsspbinary on PATH. Thedepthfeature additionally requires themsmsbinary; install viaconda install -c bioconda msms.Single-chain PDBs only — the chain whose ATOM sequence best matches
df_seq[sequence]is selected automatically.
See also
EmbeddingPreprocessor: the PLM-embedding analog.AnnotationPreprocessor: the PTM / functional-site analog.NumericalFeatureandCPP: the downstream consumers.aaanalysis.combine_dict_nums(): stitch multiple dict_nums.
Examples
encode_pdbextracts per-residue features straight from PDB / CIF ATOM records — AlphaFold model-file features (plddt,plddt_disorder,plddt_tier, sidechainchi*_sincos,ca_centroid_dist*,contact_count_*,hse) and experimentalbfactor/depth— into a[0, 1]-normalizeddict_num. Here we use the bundledAF_TINYfixture (depthis omitted as it needs the externalmsmsbinary).import warnings from pathlib import Path import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') PDB_FIXTURES = Path(aa.__file__).resolve().parent / '_data' / 'pdb_test' stp = aa.StructurePreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) feats = ['plddt', 'plddt_disorder', 'plddt_tier', 'contact_count_8A', 'bfactor'] dict_pdb = stp.encode_pdb(df_seq=df_seq, pdb_folder=str(PDB_FIXTURES), features=feats) arr = dict_pdb['AF_TINY'] print('shape (L, D):', arr.shape) print('value range:', round(float(np.nanmin(arr)), 3), '..', round(float(np.nanmax(arr)), 3))
shape (L, D): (30, 8) value range: 0.0 .. 1.0
Each row is a residue, each column a feature dimension in
[0, 1]. Passreturn_df=Truefor the(dict_num, df_seq_out)form whosepdb_okcolumn flags entries whose structure file failed to load.