AnnotationPreprocessor.to_df_seq

AnnotationPreprocessor.to_df_seq(df_seq, df_annot, feature_type, match_residue_type=True, exclude_other_annotations=True, pos_col=None, aa_context_col='aa_context')[source]

Project annotations onto df_seq for AAWindowSampler negative sampling.

Builds a df_seq copy with (1) a positives column listing the 1-based positions annotated with feature_type (the test anchors) and (2) an aa_context per-residue eligibility mask where '1' marks an eligible reference anchor and '0' everything excluded, so AAWindowSampler can draw residue-type-matched references.

Eligibility rules:

  • The feature_type positives are always excluded from the reference pool (they are the test anchors).

  • exclude_other_annotations=True (default) additionally excludes any residue carrying a different feature_type — keeps the reference set from being contaminated by, e.g., glyco-Ser when profiling phospho-Ser (which would inflate the score).

  • match_residue_type=True (default) restricts eligible anchors to the amino-acid types observed among the feature_type positives across df_annot (e.g. {S, T, Y} for phospho) — the residue-type-matched negative. Set False for residue-type-agnostic classes (e.g. predictor hotspots): the reference is then any non-annotated residue.

Added in version 1.1.0.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. The coordinate frame the annotation positions are projected onto.

  • df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from fetch_uniprot() or ingest()).

  • feature_type (str) – The registry key whose annotated residues become positives.

  • match_residue_type (bool, default=True) – Restrict eligible reference anchors to the residue types of the positives (residue-type-matched negative).

  • exclude_other_annotations (bool, default=True) – Exclude residues carrying any other feature_type from the reference pool.

  • pos_col (str, optional) – Name of the emitted positives column; defaults to ut.COL_POS.

  • aa_context_col (str, default='aa_context') – Name of the emitted per-residue eligibility-mask column.

Returns:

df_seq_out – A copy of df_seq with the pos_col (list[int]) and aa_context_col (str of ‘1’/’0’, one char per residue) columns appended.

Return type:

pd.DataFrame

Raises:

ValueError – On invalid arguments, missing schema columns, or a target-column name collision with an existing df_seq column.

Warning

UserWarning

If no residue is annotated with feature_type in df_annot.

Notes

Feed the result to AAWindowSampler with context_in='1' to keep only eligible reference anchors:

df_ws = ap.to_df_seq(df_seq, df_annot, feature_type="phospho")
df_ref = aa.AAWindowSampler().sample_same_protein(
    df_seq=df_ws, pos_col="pos", window_size=9,
    aa_context_col="aa_context", context_in="1",
    min_distance_to_pos=9)

Examples

to_df_seq exports a chosen feature_type as AAWindowSampler anchors: a df_seq with a pos column of 1-based annotated positions plus an aa_context eligibility mask. match_residue_type=True marks only same-residue-type positions as eligible ('1') for residue-type-matched negative sampling. This is the seq-mode window-split path — here an annotation is the window label.

import warnings
import numpy as np
import pandas as pd
import aaanalysis as aa
import aaanalysis.utils as ut
aa.options['verbose'] = False
warnings.filterwarnings('ignore')

ap = aa.AnnotationPreprocessor(verbose=False)
df_seq = pd.DataFrame({'entry': ['AF_TINY'],
                       'sequence': ['ACDEFGHIKLMNPQRSTVWYACDEFGHIKL']})
# A small user/predictor table -> Functional sites (open vocabulary).
df_user = pd.DataFrame({ut.COL_PROTEIN_ID: ['AF_TINY', 'AF_TINY'],
                        ut.COL_START: [3, 16],
                        ut.COL_FEATURE_TYPE: ['hotspot', 'hotspot'],
                        ut.COL_SCORE: [0.92, 0.40]})
df_annot = ap.ingest(df_user)

df_pos = ap.to_df_seq(df_seq=df_seq, df_annot=df_annot,
                      feature_type='hotspot')
df_pos[['entry', 'pos', 'aa_context']]
entry pos aa_context
0 AF_TINY [3, 16] 000000000000000000000010000000

Further parameters. AnnotationPreprocessor.to_df_seq also accepts: exclude_other_annotations — Exclude residues carrying any other feature_type from the reference pool; pos_col — Name of the emitted positives column; defaults to ut.COL_POS; aa_context_col — Name of the emitted per-residue eligibility-mask column.