AnnotationPreprocessor.fetch_uniprot
- AnnotationPreprocessor.fetch_uniprot(df_seq, features=None, evidence='manual', timeout=30.0, max_workers=None)[source]
Fetch UniProt features for every entry and map to
df_annot.Queries the UniProt REST API for each protein accession in
df_seqand maps the returned post-translational modification (PTM) and site annotations into the canonicaldf_annotschema, ready to be passed toencode(). Evidence can be filtered to retain only experimentally confirmed or manually curated entries.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers (UniProt accessions). Theentryvalues are used as the UniProtKB accessions to fetch.features (list of str, optional) – Registry keys to keep (e.g.
['phospho', 'disulfide']).Nonekeeps every built-in key.evidence ({'experimental', 'manual', 'all'}, default='manual') – Evidence allow-set.
'experimental'keeps only ECO:0000269;'manual'also keeps ECO:0007744 (combinatorial, manual);'all'disables evidence filtering. Raw ECO codes are retained in theevidencecolumn regardless.timeout (float, default=30.0) – Per-request timeout in seconds.
max_workers (int, optional) – Number of threads for concurrent fetches.
Noneor1(default) fetches entries sequentially. Greater than1fetches on a thread pool; rows are concatenated in input order and thedf_annotis identical to the sequential result. Concurrency is opt-in because parallel requests to UniProt risk HTTP-429 throttling that can turn successful fetches into failures.
- Returns:
df_annot – Canonical per-residue annotation schema with columns
protein_id, start, end, aa, feature_type, category, source, evidence, score, bond_id(positions are 1-based, UniProt-canonical frame).- Return type:
pd.DataFrame
- Raises:
ValueError – On invalid arguments.
RuntimeError – On UniProt network / response failure.
Examples
fetch_uniprotqueries the UniProt REST API perentryand maps thefeaturesarray into the canonicaldf_annotschema: bond features (disulfide / cross-link) expand to two endpoints sharing abond_id; signal / propeptide / transit cleavage P1 anchors come from the processing-span ends;SITEis description-routed.evidence='manual'(default) keeps experimentalECO:0000269+ combinatorialECO:0007744, dropping by-similarityECO:0000250.Requires
aaanalysis[pro](requests) + network — not executed here.import aaanalysis as aa # ap = aa.AnnotationPreprocessor() # df_annot = ap.fetch_uniprot(df_seq=df_seq, # features=['phospho', 'disulfide', 'binding'], # evidence='manual') # # df_annot then feeds ap.encode(df_seq=df_seq, df_annot=df_annot, ...)
Further parameters.
AnnotationPreprocessor.fetch_uniprotalso accepts:timeout— Per-request timeout in seconds.max_workers— Number of threads for concurrent fetches;None/1(default) fetches sequentially,>1fetches on a thread pool with input-order, byte-identical results (opt-in: parallel UniProt requests risk HTTP-429 throttling).