aaanalysis.AnnotationPreprocessor.to_df_seq

AnnotationPreprocessor.to_df_seq(df_seq=None, df_annot=None, feature_type=None, match_residue_type=True, exclude_other_annotations=True, pos_col=None, aa_context_col='aa_context')[source]

Project annotations onto df_seq for AAWindowSampler negative sampling.

Builds a df_seq copy with (1) a positives column listing the 1-based positions annotated with feature_type (the test anchors) and (2) an aa_context per-residue eligibility mask where '1' marks an eligible reference anchor and '0' everything excluded, so AAWindowSampler can draw residue-type-matched references.

Eligibility rules:

  • The feature_type positives are always excluded from the reference pool (they are the test anchors).

  • exclude_other_annotations=True (default) additionally excludes any residue carrying a different feature_type — keeps the reference set from being contaminated by, e.g., glyco-Ser when profiling phospho-Ser (which would inflate the score).

  • match_residue_type=True (default) restricts eligible anchors to the amino-acid types observed among the feature_type positives across df_annot (e.g. {S, T, Y} for phospho) — the residue-type-matched negative. Set False for residue-type-agnostic classes (e.g. predictor hotspots): the reference is then any non-annotated residue.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. The coordinate frame the annotation positions are projected onto.

  • df_annot (pd.DataFrame) – Canonical per-residue annotation schema (from fetch_uniprot() or ingest()).

  • feature_type (str) – The registry key whose annotated residues become positives.

  • match_residue_type (bool, default=True) – Restrict eligible reference anchors to the residue types of the positives (residue-type-matched negative).

  • exclude_other_annotations (bool, default=True) – Exclude residues carrying any other feature_type from the reference pool.

  • pos_col (str, optional) – Name of the emitted positives column; defaults to ut.COL_POS.

  • aa_context_col (str, default='aa_context') – Name of the emitted per-residue eligibility-mask column.

Returns:

df_seq_out – A copy of df_seq with the pos_col (list[int]) and aa_context_col (str of ‘1’/’0’, one char per residue) columns appended.

Return type:

pd.DataFrame

Raises:

ValueError – On invalid arguments, missing schema columns, or a target-column name collision with an existing df_seq column.

Warning

UserWarning

If no residue is annotated with feature_type in df_annot.

Notes

Feed the result to AAWindowSampler with context_in='1' to keep only eligible reference anchors:

df_ws = ap.to_df_seq(df_seq, df_annot, feature_type="phospho")
df_ref = aa.AAWindowSampler().sample_same_protein(
    df_seq=df_ws, pos_col="pos", window_size=9,
    aa_context_col="aa_context", context_in="1",
    min_distance_to_pos=9)