AAWindowSampler.sample_synthetic

AAWindowSampler.sample_synthetic(df_seq, n=100, window_size=9, generator='global_freq', pos_col=None, label_ref=0, role='Control', seed=None)[source]

Generate synthetic control windows. Always returns output_mode='segments'.

Synthetic windows have no source protein, so output_mode is not exposed (no per-residue view exists). Synthetic rows use entry_win = "synth_{i}" with a per-call counter — concatenating multiple sample_synthetic() outputs may collide on entry_win; deduplicate on the window column instead.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. See Notes.
n (int, default=100) – Maximum total number of synthetic windows. Fewer are returned (with a warning) if the filters cannot supply.
window_size (int, default=9) – Length of each synthetic window in residues.
generator (str, list/tuple of str, or dict, default='global_freq') –
Synthesis recipe. Accepts three shapes:
- str: a single built-in generator ('uniform', 'global_freq', 'position_specific', 'scrambled') or an AAontology preset name (see Notes).
- list[str] or tuple[str, ...]: at least two distinct AAontology preset names. Their priors are combined into a multiplicative joint prior over the 20 canonical amino acids (AAs) (see Notes / [LiuDeber99]). Duplicate components are rejected.
- dict[str, Real]: a custom frequency table. Keys are single-character symbols and values are non-negative probabilities summing to 1.0 (within 1e-6). Keys define the alphabet, so the sampler is not restricted to amino acids; any single-character symbols (e.g. nucleotides) work. Keys are case-sensitive ('A' and 'a' are distinct).
pos_col (str, optional) – Column with per-row 1-based positive positions. See Notes.
label_ref (int or float, default=0) – Label assigned to the synthetic rows.
role (str, default='Control') – Role tag stored in the output’s role column.
seed (int, optional) – Per-call seed; falls back to the class-level random_state.

Returns:

df_seq_out – Sampled windows; one row per window with entry, sequence, role, strategy, and entry_win columns.

Return type:

pd.DataFrame

Notes

df_seq is consumed differently by each generator:

generator='global_freq': source of empirical amino-acid frequencies across all sequences.
generator='position_specific' / 'scrambled': source of test windows extracted at the 1-based positions in pos_col. pos_col is required for these two generators.
generator='uniform', AAontology preset generators, list-mix generators, and custom dict generators: df_seq is not consumed for synthesis itself; pos_col is still optional and only used as the source of test windows for the max_similarity_to_test filter.

Built-in generators

'uniform': each residue drawn uniformly from the 20 canonical AAs.
'global_freq': residues drawn from the empirical AA frequency in df_seq.
'position_specific': per-position frequency of the test windows.
'scrambled': shuffle a randomly chosen test window.

AAontology preset generators load a curated scale via aaanalysis.load_scales() and normalize its per-amino-acid values into a probability distribution. Composition presets are true AA-frequency distributions; conformation presets are normalized propensities used as physicochemically-biased priors.

Composition (3)

'aa_composition': Dayhoff 1978a (canonical baseline)
'aa_composition_surface': Fukuchi-Nishikawa 2001 (surface composition)
'aa_composition_mp': Cedano 1997 (membrane proteins)

Conformation (7)

'alpha_helix': Chou-Fasman 1978b
'beta_sheet': Chou-Fasman 1978b
'beta_strand': Lifson-Sander 1979
'beta_turn': Chou-Fasman 1978b
'coil': Nagano 1973
'linker': George-Heringa 2003 (medium 6-14 AA)
'pi_helix': Fodje-Al-Karadaghi 2002

Mixed-prior generator (list of preset names) combines the per-AA probability vectors of the listed presets via element-wise product followed by renormalization, producing a Bayesian-style joint prior (e.g. generator=['aa_composition_mp', 'alpha_helix'] for a membrane-helix prior). Combining hydrophobicity-like composition with helicity-like conformation as the basis for transmembrane characterization is supported by [LiuDeber99].

Custom alphabet generator (dict) lets the user supply an arbitrary character-to-frequency table. Keys must be single characters (any symbol), values must be non-negative and sum to 1.0. The sampler is then no longer restricted to amino acids.

Single polymorphic ``generator`` — the three accepted shapes (built-in / preset str, list[str] for a multiplicative preset mix, dict[str, float] for a custom-alphabet frequency table) all answer the same conceptual question (“recipe for one window”), so a single parameter is preferred over three mutually-exclusive named parameters. The dispatch-on-shape complexity is absorbed by the check_synth_generator validator.

Examples

Generate synthetic control windows from a configurable amino-acid prior. Useful as a third-distribution control alongside sample_same_protein (negatives) and sample_different_protein (unlabeled). Synthetic windows have no source protein, so output_mode is fixed to 'segments'.

import aaanalysis as aa
import pandas as pd
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2"],
    "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2,
                 "YWVTSRQPNMLKIHGFEDCA" * 2],
    "pos":      [[10], [15]],
})
aaws = aa.AAWindowSampler(random_state=0)

First call — anchor schema. The eight-column segments schema is shared across all AAWindowSampler methods; synthetic rows use entry_win = synth_{i} and leave entry empty since there is no source protein.

df = aaws.sample_synthetic(df_seq=df_seq, n=5, window_size=9, seed=0)
aa.display_df(df=df, show_shape=True)

DataFrame shape: (5, 8)

	entry_win	window	source_position	role	strategy
1	synth_0	PGAATWPRM	-1	Control	synthetic:global_freq
2	synth_1	WTAVAREVM	-1	Control	synthetic:global_freq
3	synth_2	GKADQPPIY	-1	Control	synthetic:global_freq
4	synth_3	YQQQIDRMH	-1	Control	synthetic:global_freq
5	synth_4	LVWINHNHI	-1	Control	synthetic:global_freq

df_seq is consumed differently per generator — see the generator section below. pos_col is required only for the 'position_specific' and 'scrambled' generators (which read test windows from the positions in pos_col). For all other generators, pos_col is optional and is used only as the source of test windows for the anti-leakage filter.

Total number of synthetic windows (n) and residue length per window (window_size):

df = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=12,
                              generator="global_freq", seed=0)
aa.display_df(df=df[["entry_win", "window"]], show_shape=True)

DataFrame shape: (3, 2)

	entry_win	window
1	synth_0	PGAATWPRMWTA
2	synth_1	VAREVMGKADQP
3	synth_2	PIYYQQQIDRMH

The synthesis recipe. Accepts three shapes:

str — a built-in generator ('uniform', 'global_freq', 'position_specific', 'scrambled') or an AAontology preset name (e.g. 'aa_composition', 'alpha_helix').
list[str] — at least two distinct preset names; their per-AA priors are combined into a multiplicative joint prior.
dict[str, float] — a custom single-character → probability table. Keys define the alphabet, so the sampler is not restricted to amino acids.

Built-in generators draw uniformly over the canonical 20 ('uniform'), from the empirical AA frequency in df_seq ('global_freq'), per-position in the test windows ('position_specific'), or by shuffling a test window ('scrambled'):

df = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator="global_freq", seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)

DataFrame shape: (3, 2)

	window	strategy
1	PGAATWPRM	synthetic:global_freq
2	WTAVAREVM	synthetic:global_freq
3	GKADQPPIY	synthetic:global_freq

AAontology preset generators use a curated scale as the per-AA prior. Composition presets ('aa_composition', 'aa_composition_surface', 'aa_composition_mp') are true AA-frequency distributions; conformation presets ('alpha_helix', 'beta_sheet', 'beta_strand', 'beta_turn', 'coil', 'linker', 'pi_helix') are normalized propensities used as physicochemically-biased priors:

df = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator="alpha_helix", seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)

DataFrame shape: (3, 2)

	window	strategy
1	NFAASWMQM	synthetic:alpha_helix
2	WSATAREVM	synthetic:alpha_helix
3	GKADPPNIY	synthetic:alpha_helix

Mixed-prior generator — a list of at least two distinct preset names combines their per-AA priors via element-wise product followed by renormalization. Useful e.g. for a membrane-helix prior (['aa_composition_mp', 'alpha_helix']):

df = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator=["aa_composition_mp", "alpha_helix"], seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)

DataFrame shape: (3, 2)

	window	strategy
1	MFAASVLQL	synthetic:mix:a..._mp+alpha_helix
2	VSATAQETL	synthetic:mix:a..._mp+alpha_helix
3	GIACNMMIY	synthetic:mix:a..._mp+alpha_helix

Custom-alphabet generator — a dict[str, float] over any single-character alphabet (e.g. DNA). The only generator path that produces non-amino-acid windows. Keys are case-sensitive and values must sum to 1.0 (within 1e-6):

df = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=8,
                              generator={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25},
                              seed=0)
aa.display_df(df=df[["window", "strategy"]], show_shape=True)

DataFrame shape: (3, 2)

	window	strategy
1	GCAATTGG	synthetic:custom:A+C+G+T
2	GTTATAGA	synthetic:custom:A+C+G+T
3	TGCCAAGG	synthetic:custom:A+C+G+T

Semantic tags applied to all synthetic rows. Defaults: role='Control', label_ref=0. Override for non-PU-learning workflows:

df = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator="global_freq",
                              role="background", label_ref=-1, seed=0)
aa.display_df(df=df[["window", "role", "label"]], show_shape=True)

DataFrame shape: (3, 3)

	window	role	label
1	PGAATWPRM	background	-1
2	WTAVAREVM	background	-1
3	GKADQPPIY	background	-1

Per-call seed; falls back to the class-level random_state set at construction. A fixed seed yields deterministic output:

df_a = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                                generator="global_freq", seed=42)
df_b = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                                generator="global_freq", seed=42)
print("deterministic:", list(df_a["window"]) == list(df_b["window"]))

deterministic: True

Synthetic outputs use entry_win = synth_{i} with a per-call counter — concatenating multiple sample_synthetic outputs may collide on entry_win. Deduplicate on the window column instead.

pos_col names the column of per-row 1-based positive positions. It is required for the 'position_specific' and 'scrambled' generators, which build their prior from the test windows extracted at those positions (for the other generators it is optional and only feeds the anti-leakage filter):

# pos_col drives the 'position_specific' generator (per-position freq of the test windows)
df_ps = aaws.sample_synthetic(df_seq=df_seq, n=3, window_size=9,
                              generator="position_specific", pos_col="pos", seed=0)
aa.display_df(df=df_ps[["window", "strategy"]], show_shape=True)

DataFrame shape: (3, 2)

	window	strategy
1	LHIHLMNPQ	synthetic:position_specific
2	LKIKGMEPQ	synthetic:position_specific
3	GHIHLMNDQ	synthetic:position_specific