obtain_samples

obtain_samples(df_seq, pos_col='pos', strategy='same_protein', n=None, window_size=9, reliable_negatives=False, df_feat=None, max_similarity_to_test=None, plot=False, seed=None, n_jobs=None, verbose=False)[source]

Obtain a balanced training set from a described sampling situation in one call.

The first golden pipeline of a typical workflow: instead of hand-driving AAWindowSampler (roles, strategies, per-call seeds) and, for positive-unlabeled problems, dPULearn, the parameters here simply describe the situation. The known positives are the rows of df_seq carrying 1-based anchors in pos_col; n reference windows are drawn to balance them according to strategy and concatenated into one labeled 'segments'-mode set, alongside a small balance / leakage / coverage report. A thin, stateless facade: it adds no sampling algorithm and its defaults are byte-identical to the equivalent explicit primitive calls.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame with an entry column of unique protein identifiers, a sequence column of full protein sequences, and a pos_col column of 1-based positive positions.
pos_col (str, default='pos') – Column with per-row 1-based positive (test) positions; list-like, scalar, or a comma/semicolon-separated string per row. Rows with empty / missing cells carry no positive.
strategy (str, default='same_protein') –
The sampling situation, mapped to one AAWindowSampler call:
- 'same_protein' — within-protein hard negatives (sample_same_protein()).
- 'different_protein' — cross-protein unlabeled windows (sample_different_protein()); the natural unlabeled pool U and the required situation for reliable_negatives.
- 'synthetic' — synthetic control windows (sample_synthetic()).
n (int, optional) – Number of reference/negative windows to draw. If None (default), it matches the number of positive windows, yielding a balanced set.
window_size (int, default=9) – Length of each sampled window in residues.
reliable_negatives (bool, default=False) – If True, engage the positive-unlabeled (PU) path: sample an unlabeled cross-protein pool and refine it into reliable negatives with dPULearn. Requires df_feat and strategy='different_protein'.
df_feat (pd.DataFrame, optional) – Feature DataFrame with a feature column. Required when reliable_negatives is True (used to build the feature matrix X for dPULearn); ignored otherwise. The features must be compatible with windows anchored as a length-window_size TMD with the default JMD flanks.
max_similarity_to_test (float in [0, 1], optional) – If set, drop sampled windows whose per-position identity to any test window exceeds this threshold (anti-leakage); passed through to AAWindowSampler.
plot (bool, default=False) – If True, draw a stacked sequence-logo comparison of the sampled groups — one probability logo per role (e.g. Test vs the references) with a bits info-content bar on top, labeled "<role> (n=…)" and using P-site x-axis notation — so their composition / conservation can be compared, and return its list of Axes pairs.
seed (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are reproducible.
n_jobs (int, optional) – Number of CPU cores (>=1) for building the feature matrix on the PU path. If None, the optimized number is used.
verbose (bool, default=False) – If True, verbose progress information is printed.

Returns:

df_samples (pd.DataFrame) – The balanced training set in 'segments' mode: positive test windows (label=1, role='Test') plus the sampled references (label=0), with full row provenance in the role / strategy columns.
ax (list of (matplotlib.axes.Axes, matplotlib.axes.Axes) or None) – Per-group (logo, info-bar) axes pairs of the sequence-logo comparison, one per role group, or None when plot=False.
df_eval (pd.DataFrame) – One-row balance / leakage / coverage report (positive / negative counts, balance ratio, number of source proteins, protein coverage, and maximum identity of any sampled window to a test window).

See also

AAWindowSampler for the window-sampling primitives this pipeline wraps.
dPULearn for the reliable-negative identification used by the PU path.
AALogoPlot for the sequence-logo comparison drawn when plot=True.

Examples

The aaanalysis.pipe (ap) module provides high-level golden pipelines — stateless, one-call wrappers over the AAanalysis primitives. ap.obtain_samples is the first pipeline of a typical workflow: instead of hand-driving :class:AAWindowSampler (and, for positive-unlabeled problems, :class:dPULearn), the parameters simply describe the sampling situation. The known positives are the rows of df_seq carrying 1-based anchors in pos_col; reference windows are drawn to balance them and returned as one labeled 'segments'-mode training set, alongside a quick balance / leakage / coverage report.

We load the DOM_GSEC example dataset (see [Breimann25]) and mark a positive anchor at the center of every membrane-protein (label == 1); the remaining proteins carry no positive and form the cross-protein unlabeled pool:

import matplotlib.pyplot as plt
import numpy as np
import aaanalysis as aa
import aaanalysis.pipe as ap
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC", n=20)[["entry", "sequence", "label"]].copy()
df_seq["pos"] = (df_seq["sequence"].str.len() // 2).astype(int)
df_seq.loc[df_seq["label"] == 0, "pos"] = np.nan   # proteins without a positive = unlabeled pool

aa.display_df(df_seq, n_rows=10, show_shape=True)

DataFrame shape: (40, 4)

	entry	sequence	pos
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	nan
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	nan
3	Q969W9	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	nan
4	P53801	MAPGVARGPTPYWRL...GLFKEENPYARFENN	nan
5	Q8IUW5	MAPRALPGSAVLAAA...EVPATPVKRERSGTE	nan
6	P01135	MVPSAGQLALFALGI...LLKGRTACCHSETVV	nan
7	O43914	MGGLEPCSRLLLLPL...SDVYSDLNTQRPYYK	nan
8	P05556	MNLQPIFWIGLISSV...KSAVTTVVNPKYEGK	nan
9	P16234	MGTSHPAFLVLGCLL...DIGIDSSDLVEDSFL	nan
10	P50895	MEPPDAPAQARGAPR...SGGARGGSGGFGDEC	nan

Situation 1 — within-protein hard negatives (``strategy=’same_protein’``). A single call returns the balanced training set df_samples (positive Test windows plus sampled Negative windows), an optional summary plot, and the validation report df_eval. n sets the number of references (default: match the positives); window_size sets the window length; seed makes it reproducible:

df_samples, ax, df_eval = ap.obtain_samples(df_seq, strategy="same_protein",
                                             n=20, window_size=9, seed=42)

aa.display_df(df_samples, n_rows=10, show_shape=True)

DataFrame shape: (40, 8)

	entry_win	entry	sequence	window	source_position	label	role	strategy
1	P05067_381-389	P05067	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	TPGDENEHA	385	1	Test	test
2	P14925_484-492	P14925	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	QQPGEGPWE	488	1	Test	test
3	P70180_264-272	P70180	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	RIMLAVHRH	268	1	Test	test
4	Q03157_323-331	Q03157	MGPTSPAARGQGRRW...HGYENPTYRFLEERP	ERRMRQINE	327	1	Test	test
5	Q06481_377-385	Q06481	MAATGTAAAAATGRL...GYENPTYKYLEQMQI	YFETSADDN	381	1	Test	test
6	P35613_188-196	P35613	MAAALFVLLGFALLG...HQNDKGKNVRQRNSS	TEFKVDSDD	192	1	Test	test
7	P35070_85-93	P35070	MDRAARCSGASSLPL...DITPINEDIEETNIA	VVAEQTPSC	89	1	Test	test
8	P09803_438-446	P09803	MGARCRSFSALLLLL...RFKKLADMYGGGEDD	LKTAKGLDF	442	1	Test	test
9	P19022_449-457	P19022	MCRIAGALRTLLPLL...PRFKKLADMYGGGDD	PIDFETNRM	453	1	Test	test
10	P16070_367-375	P16070	MDKFWWHAAWGLCLV...DETRNLQNVDMKIGV	LIHHEHHEE	371	1	Test	test

The df_eval report summarizes class balance, the number of source proteins, protein coverage, and the maximum identity of any sampled window to a test window (the leakage indicator):

aa.display_df(df_eval, n_rows=10, show_shape=True)

DataFrame shape: (1, 6)

	n_positive	n_negative	balance_ratio	n_source_proteins	protein_coverage	max_similarity_to_test
1	20	20	1.000000	20	0.500000	0.333333

Set plot=True to also get a sequence-logo comparison of the sampled groups as the second return value — one information-content logo per role (e.g. Test vs Negative), so you can directly compare their amino-acid composition and conservation. It returns a list of Axes (None when plot=False):

aa.plot_settings()
df_samples, ax, df_eval = ap.obtain_samples(df_seq, strategy="same_protein",
                                             n=20, window_size=9, plot=True, seed=42)
plt.tight_layout()
plt.show()

../_images/ap_obtain_samples_1_output_7_0.png

Situation 2 — cross-protein unlabeled windows (``strategy=’different_protein’``) and synthetic controls (``strategy=’synthetic’``). max_similarity_to_test drops sampled windows too similar to any test window (anti-leakage); verbose and n_jobs are threaded through:

df_diff, _, df_eval_diff = ap.obtain_samples(df_seq, strategy="different_protein",
                                              n=20, window_size=9,
                                              max_similarity_to_test=0.7,
                                              seed=42, verbose=False, n_jobs=1)
df_synth, _, _ = ap.obtain_samples(df_seq, strategy="synthetic", n=20, window_size=9, seed=42)

aa.display_df(df_eval_diff, n_rows=10, show_shape=True)

DataFrame shape: (1, 6)

	n_positive	n_negative	balance_ratio	n_source_proteins	protein_coverage	max_similarity_to_test
1	20	20	1.000000	15	0.375000	0.333333

Reliable negatives (positive-unlabeled path). Set reliable_negatives=True to refine the cross-protein unlabeled pool into reliable negatives with :class:dPULearn. This path requires a feature set df_feat (used to build the feature matrix X) and strategy='different_protein' — it raises a clear error otherwise:

df_feat = aa.load_features().head(8)
df_rel, _, df_eval_rel = ap.obtain_samples(df_seq, strategy="different_protein",
                                            reliable_negatives=True, df_feat=df_feat,
                                            n=10, window_size=9, seed=42, n_jobs=1)

aa.display_df(df_eval_rel, n_rows=10, show_shape=True)

DataFrame shape: (1, 6)

/Users/stephanbreimann/Programming/1Packages/aa-wt-obtain-samples-logo/aaanalysis/pipe/_obtain_samples.py:134: UserWarning: 4 window(s) lack the flanking residues needed to build the feature matrix and are skipped on the reliable-negatives path; widen the sequences or lower 'window_size' to keep them.
  warnings.warn(f"{n_dropped} window(s) lack the flanking residues needed to build the "
/Users/stephanbreimann/Programming/1Packages/aa-wt-obtain-samples-logo/aaanalysis/feature_engineering/_backend/cpp_run.py:163: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(

	n_positive	n_negative	balance_ratio	n_source_proteins	protein_coverage	max_similarity_to_test
1	20	10	0.500000	7	0.175000	0.333333

Passing pos_col selects a differently named anchor column. The returned df_samples is a 'segments'-mode df_seq ready to feed the next pipeline (e.g. ap.find_features) — the positives and sampled references carry full provenance in the label / role / strategy columns.