aaanalysis.AAWindowSampler.sample_synthetic
- AAWindowSampler.sample_synthetic(df_seq=None, n=100, window_size=9, generator='global_freq', pos_col=None, label_ref=0, role='Control', seed=None)[source]
Generate synthetic control windows. Always returns
output_mode='segments'.Synthetic windows have no source protein, so
output_modeis not exposed (no per-residue view exists). Synthetic rows useentry_win = "synth_{i}"with a per-call counter — concatenating multiplesample_synthetic()outputs may collide onentry_win; deduplicate on thewindowcolumn instead.- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. See Notes.n (int, default=100) – Maximum total number of synthetic windows. Fewer are returned (with a warning) if the filters cannot supply.
window_size (int, default=9) – Length of each synthetic window in residues.
generator (str, list/tuple of str, or dict, default='global_freq') –
Synthesis recipe. Accepts three shapes:
str: a single built-in generator ('uniform','global_freq','position_specific','scrambled') or an AAontology preset name (see Notes).list[str]ortuple[str, ...]: at least two distinct AAontology preset names. Their priors are combined into a multiplicative joint prior over the 20 canonical AAs (see Notes / [LiuDeber99]). Duplicate components are rejected.dict[str, Real]: a custom frequency table. Keys are single-character symbols and values are non-negative probabilities summing to1.0(within1e-6). Keys define the alphabet, so the sampler is not restricted to amino acids; any single-character symbols (e.g. nucleotides) work. Keys are case-sensitive ('A'and'a'are distinct).
pos_col (str, optional) – Column with per-row 1-based positive positions. See Notes.
label_ref (int or float, default=0) – Label assigned to the synthetic rows.
role (str, default='Control') – Role tag stored in the output’s
rolecolumn.seed (int, optional) – Per-call seed; falls back to the class-level
random_state.
- Returns:
df_seq – Sampled windows; one row per window with
entry,sequence,role,strategy, andentry_wincolumns.- Return type:
pd.DataFrame
Notes
df_seqis consumed differently by each generator:generator='global_freq': source of empirical amino-acid frequencies across all sequences.generator='position_specific'/'scrambled': source of test windows extracted at the 1-based positions inpos_col.pos_colis required for these two generators.generator='uniform', AAontology preset generators, list-mix generators, and custom dict generators:df_seqis not consumed for synthesis itself;pos_colis still optional and only used as the source of test windows for themax_similarity_to_testfilter.
Built-in generators
'uniform': each residue drawn uniformly from the 20 canonical AAs.'global_freq': residues drawn from the empirical AA frequency indf_seq.'position_specific': per-position frequency of the test windows.'scrambled': shuffle a randomly chosen test window.
AAontology preset generators load a curated scale via
aaanalysis.load_scales()and normalize its per-amino-acid values into a probability distribution. Composition presets are true AA-frequency distributions; conformation presets are normalized propensities used as physicochemically-biased priors.Composition (3)
'aa_composition': Dayhoff 1978a (canonical baseline)'aa_composition_surface': Fukuchi-Nishikawa 2001 (surface composition)'aa_composition_mp': Cedano 1997 (membrane proteins)
Conformation (7)
'alpha_helix': Chou-Fasman 1978b'beta_sheet': Chou-Fasman 1978b'beta_strand': Lifson-Sander 1979'beta_turn': Chou-Fasman 1978b'coil': Nagano 1973'linker': George-Heringa 2003 (medium 6-14 AA)'pi_helix': Fodje-Al-Karadaghi 2002
Mixed-prior generator (list of preset names) combines the per-AA probability vectors of the listed presets via element-wise product followed by renormalization, producing a Bayesian-style joint prior (e.g.
generator=['aa_composition_mp', 'alpha_helix']for a membrane-helix prior). Combining hydrophobicity-like composition with helicity-like conformation as the basis for transmembrane characterization is supported by [LiuDeber99].Custom alphabet generator (dict) lets the user supply an arbitrary character-to-frequency table. Keys must be single characters (any symbol), values must be non-negative and sum to
1.0. The sampler is then no longer restricted to amino acids.Single polymorphic ``generator`` — the three accepted shapes (built-in / preset
str,list[str]for a multiplicative preset mix,dict[str, float]for a custom-alphabet frequency table) all answer the same conceptual question (“recipe for one window”), so a single parameter is preferred over three mutually-exclusive named parameters. The dispatch-on-shape complexity is absorbed by thecheck_synth_generatorvalidator.Examples
Generate synthetic control windows from a configurable amino-acid prior. Useful as a third-distribution control alongside
sample_same_protein(negatives) andsample_different_protein(unlabeled). Synthetic windows have no source protein, sooutput_modeis fixed to'segments'.import aaanalysis as aa import pandas as pd aa.options["verbose"] = False df_seq = pd.DataFrame({ "entry": ["P1", "P2"], "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2, "YWVTSRQPNMLKIHGFEDCA" * 2], "pos": [[10], [15]], }) sampler = aa.AAWindowSampler(random_state=0)
First call — anchor schema. The eight-column
segmentsschema is shared across allAAWindowSamplermethods; synthetic rows useentry_win = synth_{i}and leaveentryempty since there is no source protein.df = sampler.sample_synthetic(df_seq=df_seq, n=5, window_size=9, seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (5, 8)
entry_win entry sequence window source_position label role strategy 1 synth_0 PGAATWPRM -1 0 Control synthetic:global_freq 2 synth_1 WTAVAREVM -1 0 Control synthetic:global_freq 3 synth_2 GKADQPPIY -1 0 Control synthetic:global_freq 4 synth_3 YQQQIDRMH -1 0 Control synthetic:global_freq 5 synth_4 LVWINHNHI -1 0 Control synthetic:global_freq df_seqis consumed differently per generator — see thegeneratorsection below.pos_colis required only for the'position_specific'and'scrambled'generators (which read test windows from the positions inpos_col). For all other generators,pos_colis optional and is used only as the source of test windows for the anti-leakage filter.Total number of synthetic windows (
n) and residue length per window (window_size):df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=12, generator="global_freq", seed=0) aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (3, 2)
entry_win window 1 synth_0 PGAATWPRMWTA 2 synth_1 VAREVMGKADQP 3 synth_2 PIYYQQQIDRMH The synthesis recipe. Accepts three shapes:
str— a built-in generator ('uniform','global_freq','position_specific','scrambled') or an AAontology preset name (e.g.'aa_composition','alpha_helix').list[str]— at least two distinct preset names; their per-AA priors are combined into a multiplicative joint prior.dict[str, float]— a custom single-character → probability table. Keys define the alphabet, so the sampler is not restricted to amino acids.
Built-in generators draw uniformly over the canonical 20 (
'uniform'), from the empirical AA frequency indf_seq('global_freq'), per-position in the test windows ('position_specific'), or by shuffling a test window ('scrambled'):df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9, generator="global_freq", seed=0) aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
window strategy 1 PGAATWPRM synthetic:global_freq 2 WTAVAREVM synthetic:global_freq 3 GKADQPPIY synthetic:global_freq AAontology preset generators use a curated scale as the per-AA prior. Composition presets (
'aa_composition','aa_composition_surface','aa_composition_mp') are true AA-frequency distributions; conformation presets ('alpha_helix','beta_sheet','beta_strand','beta_turn','coil','linker','pi_helix') are normalized propensities used as physicochemically-biased priors:df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9, generator="alpha_helix", seed=0) aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
window strategy 1 NFAASWMQM synthetic:alpha_helix 2 WSATAREVM synthetic:alpha_helix 3 GKADPPNIY synthetic:alpha_helix Mixed-prior generator — a list of at least two distinct preset names combines their per-AA priors via element-wise product followed by renormalization. Useful e.g. for a membrane-helix prior (
['aa_composition_mp', 'alpha_helix']):df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9, generator=["aa_composition_mp", "alpha_helix"], seed=0) aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
window strategy 1 MFAASVLQL synthetic:mix:a..._mp+alpha_helix 2 VSATAQETL synthetic:mix:a..._mp+alpha_helix 3 GIACNMMIY synthetic:mix:a..._mp+alpha_helix Custom-alphabet generator — a
dict[str, float]over any single-character alphabet (e.g. DNA). The only generator path that produces non-amino-acid windows. Keys are case-sensitive and values must sum to 1.0 (within 1e-6):df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=8, generator={"A": 0.25, "C": 0.25, "G": 0.25, "T": 0.25}, seed=0) aa.display_df(df=df[["window", "strategy"]], show_shape=True)
DataFrame shape: (3, 2)
window strategy 1 GCAATTGG synthetic:custom:A+C+G+T 2 GTTATAGA synthetic:custom:A+C+G+T 3 TGCCAAGG synthetic:custom:A+C+G+T Semantic tags applied to all synthetic rows. Defaults:
role='Control',label_ref=0. Override for non-PU-learning workflows:df = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9, generator="global_freq", role="background", label_ref=-1, seed=0) aa.display_df(df=df[["window", "role", "label"]], show_shape=True)
DataFrame shape: (3, 3)
window role label 1 PGAATWPRM background -1 2 WTAVAREVM background -1 3 GKADQPPIY background -1 Per-call seed; falls back to the class-level
random_stateset at construction. A fixed seed yields deterministic output:df_a = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9, generator="global_freq", seed=42) df_b = sampler.sample_synthetic(df_seq=df_seq, n=3, window_size=9, generator="global_freq", seed=42) print("deterministic:", list(df_a["window"]) == list(df_b["window"]))
deterministic: True
Synthetic outputs use
entry_win = synth_{i}with a per-call counter — concatenating multiplesample_syntheticoutputs may collide onentry_win. Deduplicate on thewindowcolumn instead.