StructurePreprocessor.fetch_alphafold
- StructurePreprocessor.fetch_alphafold(df_seq=None, out_folder=None, file_format='pdb', timeout=30.0, skip_existing=True, on_failure='nan', return_df=False, max_workers=None)[source]
Download AlphaFold model + Predicted Aligned Error (PAE) files for every entry into a folder.
Fetches each entry’s F1 structure and PAE sidecar from the AlphaFold Protein Structure Database [Varadi22] (https://alphafold.ebi.ac.uk), saving them under the canonical filenames
encode_pdb()/encode_pae()/get_dssp()already resolve — so a single call populates thepdb_folder/pae_folderthose methods consume. The download URLs are resolved through the AlphaFold API, so the fetch tracks the current data version automatically (the file naming movedv4→v6and will move again). This is thefetch_(web) acquisition counterpart to the localget_tools; it downloads all entries in one bulk call.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers (UniProt accessions). Theentryvalues are used as the AlphaFold-DB accessions to download.out_folder (str or pathlib.Path) – Destination directory for the downloaded files. Its parent must exist; the leaf directory is created if absent.
file_format ({'pdb', 'cif'}, default='pdb') –
Structure file type to download from AlphaFold DB (both are accepted by the downstream resolvers):
'pdb': legacy Protein Data Bank format; matches theencode_pdb/get_dsspexamples.'cif': Crystallographic Information File (mmCIF), the modern format without the line/column limits that constrain'pdb'for very large structures.
timeout (float, default=30.0) – Per-request timeout in seconds.
skip_existing (bool, default=True) – If
True, an entry whose model and PAE files both already exist inout_folderis not re-downloaded; an entry with only one of the two present re-fetches only the missing file.on_failure ({'nan', 'drop', 'raise'}, default='nan') – Policy for entries missing from AlphaFold DB (a 404 on either file):
'nan'keeps the row marked not-ok;'drop'removes it from the returned status;'raise'raisesRuntimeError.return_df (bool, default=False) – If
True, also return an echo ofdf_seqwith an appended booleanalphafold_okcolumn as a second element.max_workers (int, optional) – Number of threads for concurrent downloads.
Noneor1(default) fetches entries sequentially. Greater than1downloads on a thread pool; the status table is reassembled in input order and is identical to the sequential result. Concurrency is opt-in because parallel requests to AlphaFold DB risk HTTP-429 throttling that can turn successful downloads into failures.
- Returns:
df_status (pd.DataFrame, shape (n_samples, 7)) – Per-entry download status with columns
entry, model_ok, pae_ok, alphafold_ok, skipped, model_path, pae_path.df_seq_out (pd.DataFrame) – Returned only when
return_df=True. Echo ofdf_seqplus a booleanalphafold_okcolumn.
Notes
Only single-fragment models are fetched (the canonical
AF-<entry>-F1-…-v4filenames). Proteins fragmented acrossF2,F3, … in AlphaFold DB have noF1model and are reported as not-ok.Network failures other than a 404 (timeout, 5xx) raise
RuntimeErrorand abort the bulk download — they are not absorbed byon_failure, which governs only the missing-from-AlphaFold Database (AF-DB) case.
- Raises:
ValueError – On invalid arguments.
RuntimeError – On network / response failure, or if any entry is missing under
on_failure='raise'.
Examples
fetch_alphafoldbulk-downloads each entry’s AlphaFold-DB model file (AF-<entry>-F1-model_v4.pdb/.cif) and its PAE sidecar from https://alphafold.ebi.ac.uk into one folder, saving them under the namesencode_pdb/encode_pae/get_dsspalready resolve — so a single call populates the folder the encoders consume, with no glue. It is thefetch_(web) acquisition verb, the structure-side analog ofAnnotationPreprocessor.fetch_uniprot.It returns a per-entry status DataFrame (
entry, model_ok, pae_ok, alphafold_ok, skipped, model_path, pae_path). A 404 (accession not in AF-DB, or fragmentedF2+proteins) is the soft failure governed byon_failure; other network errors raiseRuntimeError.Requires
aaanalysis[pro]and network access — not executed here.import aaanalysis as aa # df_seq has an 'entry' column of UniProt accessions. # stp = aa.StructurePreprocessor() # df_status = stp.fetch_alphafold(df_seq=df_seq, out_folder='af_files/') # # then encode straight from the populated folder: # dict_pdb = stp.encode_pdb(df_seq=df_seq, pdb_folder='af_files/', # features=['plddt', 'bfactor']) # dict_pae = stp.encode_pae(df_seq=df_seq, pae_folder='af_files/', # features=['pae_row_mean'])
Further parameters.
StructurePreprocessor.fetch_alphafoldalso accepts:file_format— Structure file type to download;timeout— Per-request timeout in seconds;skip_existing— IfTrue, an entry whose model and PAE files both already exist inout_folderis not re-downloaded; an entry with only one of the two present re-fetches only the missing file;return_df— IfTrue, also return an echo ofdf_seqwith an appended booleanalphafold_okcolumn as a second element.max_workers— Number of threads for concurrent downloads;None/1(default) fetches sequentially,>1downloads on a thread pool with input-order, byte-identical results (opt-in: parallel AlphaFold-DB requests risk HTTP-429 throttling);