AnnotationPreprocessor.to_df_seq
- AnnotationPreprocessor.to_df_seq(df_seq, df_annot, feature_type, match_residue_type=True, exclude_other_annotations=True, pos_col=None, aa_context_col='aa_context')[source]
Project annotations onto
df_seqfor AAWindowSampler negative sampling.Builds a
df_seqcopy with (1) a positives column listing the 1-based positions annotated withfeature_type(the test anchors) and (2) anaa_contextper-residue eligibility mask where'1'marks an eligible reference anchor and'0'everything excluded, soAAWindowSamplercan draw residue-type-matched references.Eligibility rules:
The
feature_typepositives are always excluded from the reference pool (they are the test anchors).exclude_other_annotations=True(default) additionally excludes any residue carrying a differentfeature_type— keeps the reference set from being contaminated by, e.g., glyco-Ser when profiling phospho-Ser (which would inflate the score).match_residue_type=True(default) restricts eligible anchors to the amino-acid types observed among thefeature_typepositives acrossdf_annot(e.g. {S, T, Y} for phospho) — the residue-type-matched negative. SetFalsefor residue-type-agnostic classes (e.g. predictor hotspots): the reference is then any non-annotated residue.
Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. The coordinate frame the annotation positions are projected onto.df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from
fetch_uniprot()oringest()).feature_type (str) – The registry key whose annotated residues become positives.
match_residue_type (bool, default=True) – Restrict eligible reference anchors to the residue types of the positives (residue-type-matched negative).
exclude_other_annotations (bool, default=True) – Exclude residues carrying any other
feature_typefrom the reference pool.pos_col (str, optional) – Name of the emitted positives column; defaults to
ut.COL_POS.aa_context_col (str, default='aa_context') – Name of the emitted per-residue eligibility-mask column.
- Returns:
df_seq_out – A copy of
df_seqwith thepos_col(list[int]) andaa_context_col(str of ‘1’/’0’, one char per residue) columns appended.- Return type:
pd.DataFrame
- Raises:
ValueError – On invalid arguments, missing schema columns, or a target-column name collision with an existing
df_seqcolumn.
Warning
- UserWarning
If no residue is annotated with
feature_typeindf_annot.
Notes
Feed the result to
AAWindowSamplerwithcontext_in='1'to keep only eligible reference anchors:df_ws = ap.to_df_seq(df_seq, df_annot, feature_type="phospho") df_ref = aa.AAWindowSampler().sample_same_protein( df_seq=df_ws, pos_col="pos", window_size=9, aa_context_col="aa_context", context_in="1", min_distance_to_pos=9)
Examples
to_df_seqexports a chosenfeature_typeasAAWindowSampleranchors: adf_seqwith aposcolumn of 1-based annotated positions plus anaa_contexteligibility mask.match_residue_type=Truemarks only same-residue-type positions as eligible ('1') for residue-type-matched negative sampling. This is the seq-mode window-split path — here an annotation is the window label.import warnings import numpy as np import pandas as pd import aaanalysis as aa import aaanalysis.utils as ut aa.options['verbose'] = False warnings.filterwarnings('ignore') ap = aa.AnnotationPreprocessor(verbose=False) df_seq = pd.DataFrame({'entry': ['AF_TINY'], 'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']}) # A small user/predictor table -> Functional sites (open vocabulary). df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'], ut.COL_START: [3, 16], ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'], ut.COL_SCORE: [0.92, 0.40]}) df_annot = ap.ingest(df_user) df_pos = ap.to_df_seq(df_seq=df_seq, df_annot=df_annot, feature_type='hotspot') df_pos[['entry', 'pos', 'aa_context']]
entry pos aa_context 0 AF_TINY [3, 16] 000000000000000000000010000000 Further parameters.
AnnotationPreprocessor.to_df_seqalso accepts:exclude_other_annotations— Exclude residues carrying any otherfeature_typefrom the reference pool;pos_col— Name of the emitted positives column; defaults tout.COL_POS;aa_context_col— Name of the emitted per-residue eligibility-mask column.