aaanalysis.AnnotationPreprocessor.to_df_seq
- AnnotationPreprocessor.to_df_seq(df_seq=None, df_annot=None, feature_type=None, match_residue_type=True, exclude_other_annotations=True, pos_col=None, aa_context_col='aa_context')[source]
Project annotations onto
df_seqfor AAWindowSampler negative sampling.Builds a
df_seqcopy with (1) a positives column listing the 1-based positions annotated withfeature_type(the test anchors) and (2) anaa_contextper-residue eligibility mask where'1'marks an eligible reference anchor and'0'everything excluded, soAAWindowSamplercan draw residue-type-matched references.Eligibility rules:
The
feature_typepositives are always excluded from the reference pool (they are the test anchors).exclude_other_annotations=True(default) additionally excludes any residue carrying a differentfeature_type— keeps the reference set from being contaminated by, e.g., glyco-Ser when profiling phospho-Ser (which would inflate the score).match_residue_type=True(default) restricts eligible anchors to the amino-acid types observed among thefeature_typepositives acrossdf_annot(e.g. {S, T, Y} for phospho) — the residue-type-matched negative. SetFalsefor residue-type-agnostic classes (e.g. predictor hotspots): the reference is then any non-annotated residue.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. The coordinate frame the annotation positions are projected onto.df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from
fetch_uniprot()oringest()).feature_type (str) – The registry key whose annotated residues become positives.
match_residue_type (bool, default=True) – Restrict eligible reference anchors to the residue types of the positives (residue-type-matched negative).
exclude_other_annotations (bool, default=True) – Exclude residues carrying any other
feature_typefrom the reference pool.pos_col (str, optional) – Name of the emitted positives column; defaults to
ut.COL_POS.aa_context_col (str, default='aa_context') – Name of the emitted per-residue eligibility-mask column.
- Returns:
df_seq_out – A copy of
df_seqwith thepos_col(list[int]) andaa_context_col(str of ‘1’/’0’, one char per residue) columns appended.- Return type:
pd.DataFrame
- Raises:
ValueError – On invalid arguments, missing schema columns, or a target-column name collision with an existing
df_seqcolumn.
Warning
- UserWarning
If no residue is annotated with
feature_typeindf_annot.
Notes
Feed the result to
AAWindowSamplerwithcontext_in='1'to keep only eligible reference anchors:df_ws = ap.to_df_seq(df_seq, df_annot, feature_type="phospho") df_ref = aa.AAWindowSampler().sample_same_protein( df_seq=df_ws, pos_col="pos", window_size=9, aa_context_col="aa_context", context_in="1", min_distance_to_pos=9)