AnnotationPreprocessor.fetch_uniprot

AnnotationPreprocessor.fetch_uniprot(df_seq, features=None, evidence='manual', timeout=30.0, max_workers=None)[source]

Fetch UniProt features for every entry and map to df_annot.

Queries the UniProt REST API for each protein accession in df_seq and maps the returned post-translational modification (PTM) and site annotations into the canonical df_annot schema, ready to be passed to encode(). Evidence can be filtered to retain only experimentally confirmed or manually curated entries.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers (UniProt accessions). The entry values are used as the UniProtKB accessions to fetch.

  • features (list of str, optional) – Registry keys to keep (e.g. ['phospho', 'disulfide']). None keeps every built-in key.

  • evidence ({'experimental', 'manual', 'all'}, default='manual') – Evidence allow-set. 'experimental' keeps only ECO:0000269; 'manual' also keeps ECO:0007744 (combinatorial, manual); 'all' disables evidence filtering. Raw ECO codes are retained in the evidence column regardless.

  • timeout (float, default=30.0) – Per-request timeout in seconds.

  • max_workers (int, optional) – Number of threads for concurrent fetches. None or 1 (default) fetches entries sequentially. Greater than 1 fetches on a thread pool; rows are concatenated in input order and the df_annot is identical to the sequential result. Concurrency is opt-in because parallel requests to UniProt risk HTTP-429 throttling that can turn successful fetches into failures.

Returns:

df_annot – Canonical per-residue annotation schema with columns protein_id, start, end, aa, feature_type, category, source, evidence, score, bond_id (positions are 1-based, UniProt-canonical frame).

Return type:

pd.DataFrame

Raises:

Examples

fetch_uniprot queries the UniProt REST API per entry and maps the features array into the canonical df_annot schema: bond features (disulfide / cross-link) expand to two endpoints sharing a bond_id; signal / propeptide / transit cleavage P1 anchors come from the processing-span ends; SITE is description-routed. evidence='manual' (default) keeps experimental ECO:0000269 + combinatorial ECO:0007744, dropping by-similarity ECO:0000250.

Requires aaanalysis[pro] (requests) + network — not executed here.

import aaanalysis as aa

# ap = aa.AnnotationPreprocessor()
# df_annot = ap.fetch_uniprot(df_seq=df_seq,
#     features=['phospho', 'disulfide', 'binding'],
#     evidence='manual')
# # df_annot then feeds ap.encode(df_seq=df_seq, df_annot=df_annot, ...)

Further parameters. AnnotationPreprocessor.fetch_uniprot also accepts: timeout — Per-request timeout in seconds. max_workers — Number of threads for concurrent fetches; None / 1 (default) fetches sequentially, >1 fetches on a thread pool with input-order, byte-identical results (opt-in: parallel UniProt requests risk HTTP-429 throttling).