aaanalysis.AAWindowSampler.sample_synthetic

AAWindowSampler.sample_synthetic(df_seq=None, n=100, window_size=9, generator='global_freq', pos_col=None, label_ref=0, role='Control', seed=None)[source]

Generate synthetic control windows. Always returns output_mode='segments'.

Synthetic windows have no source protein, so output_mode is not exposed (no per-residue view exists). Synthetic rows use entry_win = "synth_{i}" with a per-call counter — concatenating multiple sample_synthetic() outputs may collide on entry_win; deduplicate on the window column instead.

Parameters:
  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. See Notes.

  • n (int, default=100) – Maximum total number of synthetic windows. Fewer are returned (with a warning) if the filters cannot supply.

  • window_size (int, default=9) – Length of each synthetic window in residues.

  • generator (str, list/tuple of str, or dict, default='global_freq') –

    Synthesis recipe. Accepts three shapes:

    • str: a single built-in generator ('uniform', 'global_freq', 'position_specific', 'scrambled') or an AAontology preset name (see Notes).

    • list[str] or tuple[str, ...]: at least two distinct AAontology preset names. Their priors are combined into a multiplicative joint prior over the 20 canonical AAs (see Notes / [LiuDeber99]). Duplicate components are rejected.

    • dict[str, Real]: a custom frequency table. Keys are single-character symbols and values are non-negative probabilities summing to 1.0 (within 1e-6). Keys define the alphabet, so the sampler is not restricted to amino acids; any single-character symbols (e.g. nucleotides) work. Keys are case-sensitive ('A' and 'a' are distinct).

  • pos_col (str, optional) – Column with per-row 1-based positive positions. See Notes.

  • label_ref (int or float, default=0) – Label assigned to the synthetic rows.

  • role (str, default='Control') – Role tag stored in the output’s role column.

  • seed (int, optional) – Per-call seed; falls back to the class-level random_state.

Returns:

df_seq – Sampled windows; one row per window with entry, sequence, role, strategy, and entry_win columns.

Return type:

pd.DataFrame

Notes

df_seq is consumed differently by each generator:

  • generator='global_freq': source of empirical amino-acid frequencies across all sequences.

  • generator='position_specific' / 'scrambled': source of test windows extracted at the 1-based positions in pos_col. pos_col is required for these two generators.

  • generator='uniform', AAontology preset generators, list-mix generators, and custom dict generators: df_seq is not consumed for synthesis itself; pos_col is still optional and only used as the source of test windows for the max_similarity_to_test filter.

Built-in generators

  • 'uniform': each residue drawn uniformly from the 20 canonical AAs.

  • 'global_freq': residues drawn from the empirical AA frequency in df_seq.

  • 'position_specific': per-position frequency of the test windows.

  • 'scrambled': shuffle a randomly chosen test window.

AAontology preset generators load a curated scale via aaanalysis.load_scales() and normalize its per-amino-acid values into a probability distribution. Composition presets are true AA-frequency distributions; conformation presets are normalized propensities used as physicochemically-biased priors.

Composition (3)

  • 'aa_composition': Dayhoff 1978a (canonical baseline)

  • 'aa_composition_surface': Fukuchi-Nishikawa 2001 (surface composition)

  • 'aa_composition_mp': Cedano 1997 (membrane proteins)

Conformation (7)

  • 'alpha_helix': Chou-Fasman 1978b

  • 'beta_sheet': Chou-Fasman 1978b

  • 'beta_strand': Lifson-Sander 1979

  • 'beta_turn': Chou-Fasman 1978b

  • 'coil': Nagano 1973

  • 'linker': George-Heringa 2003 (medium 6-14 AA)

  • 'pi_helix': Fodje-Al-Karadaghi 2002

Mixed-prior generator (list of preset names) combines the per-AA probability vectors of the listed presets via element-wise product followed by renormalization, producing a Bayesian-style joint prior (e.g. generator=['aa_composition_mp', 'alpha_helix'] for a membrane-helix prior). Combining hydrophobicity-like composition with helicity-like conformation as the basis for transmembrane characterization is supported by [LiuDeber99].

Custom alphabet generator (dict) lets the user supply an arbitrary character-to-frequency table. Keys must be single characters (any symbol), values must be non-negative and sum to 1.0. The sampler is then no longer restricted to amino acids.

Single polymorphic ``generator`` — the three accepted shapes (built-in / preset str, list[str] for a multiplicative preset mix, dict[str, float] for a custom-alphabet frequency table) all answer the same conceptual question (“recipe for one window”), so a single parameter is preferred over three mutually-exclusive named parameters. The dispatch-on-shape complexity is absorbed by the check_synth_generator validator.

Examples

Generate synthetic control windows from a configurable amino-acid prior. Useful as a third-distribution control alongside sample_same_protein (negatives) and sample_different_protein (unlabeled). Synthetic windows have no source protein, so output_mode is fixed to 'segments'.

import aaanalysis as aa
import pandas as pd
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2"],
    "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2,
                 "YWVTSRQPNMLKIHGFEDCA" * 2],
    "pos":      [[10], [15]],
})
sampler = aa.AAWindowSampler(random_state=0)

First call — anchor schema. The eight-column segments schema is shared across all AAWindowSampler methods; synthetic rows use entry_win = synth_{i} and leave entry empty since there is no source protein.

df = sampler.sample_synthetic(df_seq=df_seq, n=5, window_size=9, seed=0)
aa.display_df(df=df, show_shape=True)
DataFrame shape: (5, 8)
  entry_win entry sequence window source_position label role strategy
1 synth_0 PGAATWPRM -1 0 Control synthetic:global_freq
2 synth_1 WTAVAREVM -1 0 Control synthetic:global_freq
3 synth_2 GKADQPPIY -1 0 Control synthetic:global_freq
4 synth_3 YQQQIDRMH -1 0 Control synthetic:global_freq
5 synth_4 LVWINHNHI -1 0 Control synthetic:global_freq

df_seq is consumed differently per generator — see the generator section below. pos_col is required only for the 'position_specific' and 'scrambled' generators (which read test windows from the positions in pos_col). For all other generators, pos_col is optional and is used only as the source of test windows for the anti-leakage filter.

Total number of synthetic windows (n) and residue length per window (window_size):

df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=12,
                              generator="global_freq", seed=0)
aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (3, 2)
  entry_win window
1 synth_0 PGAATWPRMWTA
2 synth_1 VAREVMGKADQP
3 synth_2 PIYYQQQIDRMH

The synthesis recipe. Accepts three shapes:

  • str — a built-in generator ('uniform', 'global_freq', 'position_specific', 'scrambled') or an AAontology preset name (e.g. 'aa_composition', 'alpha_helix').

  • list[str] — at least two distinct preset names; their per-AA priors are combined into a multiplicative joint prior.

  • dict[str, float] — a custom single-character → probability table. Keys define the alphabet, so the sampler is not restricted to amino acids.

Built-in generators draw uniformly over the canonical 20 ('uniform'), from the empirical AA frequency in df_seq ('global_freq'), per-position in the test windows ('position_specific'), or by shuffling a test window ('scrambled'):

df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator="global_freq", seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
  window strategy
1 PGAATWPRM synthetic:global_freq
2 WTAVAREVM synthetic:global_freq
3 GKADQPPIY synthetic:global_freq

AAontology preset generators use a curated scale as the per-AA prior. Composition presets ('aa_composition', 'aa_composition_surface', 'aa_composition_mp') are true AA-frequency distributions; conformation presets ('alpha_helix', 'beta_sheet', 'beta_strand', 'beta_turn', 'coil', 'linker', 'pi_helix') are normalized propensities used as physicochemically-biased priors:

df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator="alpha_helix", seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
  window strategy
1 NFAASWMQM synthetic:alpha_helix
2 WSATAREVM synthetic:alpha_helix
3 GKADPPNIY synthetic:alpha_helix

Mixed-prior generator — a list of at least two distinct preset names combines their per-AA priors via element-wise product followed by renormalization. Useful e.g. for a membrane-helix prior (['aa_composition_mp', 'alpha_helix']):

df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator=["aa_composition_mp", "alpha_helix"], seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
  window strategy
1 MFAASVLQL synthetic:mix:a..._mp+alpha_helix
2 VSATAQETL synthetic:mix:a..._mp+alpha_helix
3 GIACNMMIY synthetic:mix:a..._mp+alpha_helix

Custom-alphabet generator — a dict[str, float] over any single-character alphabet (e.g. DNA). The only generator path that produces non-amino-acid windows. Keys are case-sensitive and values must sum to 1.0 (within 1e-6):

df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=8,
                              generator={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25},
                              seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
  window strategy
1 GCAATTGG synthetic:custom:A+C+G+T
2 GTTATAGA synthetic:custom:A+C+G+T
3 TGCCAAGG synthetic:custom:A+C+G+T

Semantic tags applied to all synthetic rows. Defaults: role='Control', label_ref=0. Override for non-PU-learning workflows:

df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator="global_freq",
                              role="background", label_ref=-1, seed=0)
aa.display_df(df=df[["window", "role", "label"]], show_shape=True)
DataFrame shape: (3, 3)
  window role label
1 PGAATWPRM background -1
2 WTAVAREVM background -1
3 GKADQPPIY background -1

Per-call seed; falls back to the class-level random_state set at construction. A fixed seed yields deterministic output:

df_a = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                                generator="global_freq", seed=42)
df_b = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                                generator="global_freq", seed=42)
print("deterministic:", list(df_a["window"]) == list(df_b["window"]))
deterministic: True

Synthetic outputs use entry_win = synth_{i} with a per-call counter — concatenating multiple sample_synthetic outputs may collide on entry_win. Deduplicate on the window column instead.