StructurePreprocessor.fetch_alphafold

StructurePreprocessor.fetch_alphafold(df_seq=None, out_folder=None, file_format='pdb', timeout=30.0, skip_existing=True, on_failure='nan', return_df=False, max_workers=None)[source]

Download AlphaFold model + Predicted Aligned Error (PAE) files for every entry into a folder.

Fetches each entry’s F1 structure and PAE sidecar from the AlphaFold Protein Structure Database [Varadi22] (https://alphafold.ebi.ac.uk), saving them under the canonical filenames encode_pdb() / encode_pae() / get_dssp() already resolve — so a single call populates the pdb_folder / pae_folder those methods consume. The download URLs are resolved through the AlphaFold API, so the fetch tracks the current data version automatically (the file naming moved v4v6 and will move again). This is the fetch_ (web) acquisition counterpart to the local get_ tools; it downloads all entries in one bulk call.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers (UniProt accessions). The entry values are used as the AlphaFold-DB accessions to download.

  • out_folder (str or pathlib.Path) – Destination directory for the downloaded files. Its parent must exist; the leaf directory is created if absent.

  • file_format ({'pdb', 'cif'}, default='pdb') –

    Structure file type to download from AlphaFold DB (both are accepted by the downstream resolvers):

    • 'pdb': legacy Protein Data Bank format; matches the encode_pdb / get_dssp examples.

    • 'cif': Crystallographic Information File (mmCIF), the modern format without the line/column limits that constrain 'pdb' for very large structures.

  • timeout (float, default=30.0) – Per-request timeout in seconds.

  • skip_existing (bool, default=True) – If True, an entry whose model and PAE files both already exist in out_folder is not re-downloaded; an entry with only one of the two present re-fetches only the missing file.

  • on_failure ({'nan', 'drop', 'raise'}, default='nan') – Policy for entries missing from AlphaFold DB (a 404 on either file): 'nan' keeps the row marked not-ok; 'drop' removes it from the returned status; 'raise' raises RuntimeError.

  • return_df (bool, default=False) – If True, also return an echo of df_seq with an appended boolean alphafold_ok column as a second element.

  • max_workers (int, optional) – Number of threads for concurrent downloads. None or 1 (default) fetches entries sequentially. Greater than 1 downloads on a thread pool; the status table is reassembled in input order and is identical to the sequential result. Concurrency is opt-in because parallel requests to AlphaFold DB risk HTTP-429 throttling that can turn successful downloads into failures.

Returns:

  • df_status (pd.DataFrame, shape (n_samples, 7)) – Per-entry download status with columns entry, model_ok, pae_ok, alphafold_ok, skipped, model_path, pae_path.

  • df_seq_out (pd.DataFrame) – Returned only when return_df=True. Echo of df_seq plus a boolean alphafold_ok column.

Notes

  • Only single-fragment models are fetched (the canonical AF-<entry>-F1-…-v4 filenames). Proteins fragmented across F2, F3, … in AlphaFold DB have no F1 model and are reported as not-ok.

  • Network failures other than a 404 (timeout, 5xx) raise RuntimeError and abort the bulk download — they are not absorbed by on_failure, which governs only the missing-from-AlphaFold Database (AF-DB) case.

Raises:
  • ValueError – On invalid arguments.

  • RuntimeError – On network / response failure, or if any entry is missing under on_failure='raise'.

Examples

fetch_alphafold bulk-downloads each entry’s AlphaFold-DB model file (AF-<entry>-F1-model_v4.pdb/.cif) and its PAE sidecar from https://alphafold.ebi.ac.uk into one folder, saving them under the names encode_pdb / encode_pae / get_dssp already resolve — so a single call populates the folder the encoders consume, with no glue. It is the fetch_ (web) acquisition verb, the structure-side analog of AnnotationPreprocessor.fetch_uniprot.

It returns a per-entry status DataFrame (entry, model_ok, pae_ok, alphafold_ok, skipped, model_path, pae_path). A 404 (accession not in AF-DB, or fragmented F2+ proteins) is the soft failure governed by on_failure; other network errors raise RuntimeError.

Requires aaanalysis[pro] and network access — not executed here.

import aaanalysis as aa

# df_seq has an 'entry' column of UniProt accessions.
# stp = aa.StructurePreprocessor()
# df_status = stp.fetch_alphafold(df_seq=df_seq, out_folder='af_files/')
# # then encode straight from the populated folder:
# dict_pdb = stp.encode_pdb(df_seq=df_seq, pdb_folder='af_files/',
#                          features=['plddt', 'bfactor'])
# dict_pae = stp.encode_pae(df_seq=df_seq, pae_folder='af_files/',
#                          features=['pae_row_mean'])

Further parameters. StructurePreprocessor.fetch_alphafold also accepts: file_format — Structure file type to download; timeout — Per-request timeout in seconds; skip_existing — If True, an entry whose model and PAE files both already exist in out_folder is not re-downloaded; an entry with only one of the two present re-fetches only the missing file; return_df — If True, also return an echo of df_seq with an appended boolean alphafold_ok column as a second element. max_workers — Number of threads for concurrent downloads; None / 1 (default) fetches sequentially, >1 downloads on a thread pool with input-order, byte-identical results (opt-in: parallel AlphaFold-DB requests risk HTTP-429 throttling);