AAWindowSampler.sample_same_protein

AAWindowSampler.sample_same_protein(df_seq, n=100, window_size=9, pos_col='pos', min_distance_to_pos=None, max_distance_to_pos=None, label_test=1, label_ref=0, role='Negative', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, motif_pwm=None, motif_score_threshold=None, motif_match='in', seed=None)[source]

Sample windows from proteins that contain at least one test position.

Draws up to n reference windows from the same proteins that carry a labeled test position, making it the natural source for within-protein hard negatives. Windows are distributed roughly uniformly across eligible proteins and filtered by the similarity thresholds set on AAWindowSampler. Complement this method with sample_different_protein() when an unlabeled cross-protein pool is also needed.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. See Notes.
n (int, default=100) – Maximum total number of sampled windows across all eligible proteins. n is split roughly uniformly across eligible source proteins (each protein gets ~``n / n_proteins`` windows); shortfalls from proteins with small candidate pools are redistributed round-robin. Fewer than n are returned (with a warning) if the eligible space cannot supply.
window_size (int, default=9) – Length of each sampled window in residues.
pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.
min_distance_to_pos (int, optional) – Minimum residue distance from the nearest positive on the same protein (a sampled P1 anchor c is admitted only if min(|c - p| for p in positives) >= min_distance_to_pos). None (default) drops this lower bound — sampled windows are allowed to overlap positive windows.
max_distance_to_pos (int, optional) – Maximum residue distance from the nearest positive on the same protein (a sampled P1 anchor c is admitted only if min(|c - p| for p in positives) <= max_distance_to_pos). None (default) drops this upper bound — sampled windows may sit anywhere on the protein.
label_test (int or float, default=1) – Label assigned to positives in output_mode='sequences'.
label_ref (int or float, default=0) – Label assigned to sampled reference positions / rows.
role (str, default='Negative') – Role tag stored in the output’s role column.
output_mode ({'segments', 'sequences'}, default='segments') –
Output schema (see Notes):
- 'segments': one row per sampled window with schema [entry_win, entry, sequence, window, source_position, label, role, strategy].
- 'sequences': one row per source protein with a per-residue labels list (label_test at positives, label_ref at sampled positions, None elsewhere).
aa_context_col (str, optional) – Per-residue context column used with context_in / context_out.
context_in (value or list-like, optional) – Whitelist of aa_context_col tag values for eligible residues.
context_out (value or list-like, optional) – Blacklist of aa_context_col tag values for excluded residues.
motif_pwm (pd.DataFrame, optional) – Position Weight Matrix (PWM) of shape (window_size, 20) whose columns are the 20 canonical amino acid (AA) letters in any order (reindexed internally to ut.LIST_CANONICAL_AA). Required together with motif_score_threshold when motif filtering is desired.
motif_score_threshold (float, optional) – PWM score threshold; required when motif_pwm is set.
motif_match ({'in', 'out'}, default='in') – 'in' keeps windows with score >= threshold; 'out' keeps the rest.
seed (int, optional) – Per-call seed; falls back to the class-level random_state.

Returns:

df_seq_out – Sampled windows; one row per window with entry, sequence, role, strategy, and entry_win columns.

Return type:

pd.DataFrame

Notes

Each row of df_seq whose pos_col cell is a non-empty list / tuple / array of 1-based integer positions is a “positive” row; rows with empty / None / NaN cells are skipped. Sampled windows are drawn from the same proteins as the positives; the positive windows themselves drive the max_similarity_to_test filter. The (min_distance_to_pos, max_distance_to_pos) band is exposed only on this method; sample_different_protein() and sample_motif_matched() sample from proteins with no listed positives, so the band has nothing to act on.

With the default None / None band, sampled centers can sit directly on or adjacent to positive anchors, producing windows that overlap positive windows by up to window_size - 1 residues. For hard-negative-style sampling that excludes positional overlap, set min_distance_to_pos=window_size; to constrain sampled windows to a defined neighborhood of positives (e.g. local hard negatives), pair with a finite max_distance_to_pos. Content-level overlap is controlled separately by max_similarity_to_test.

Protein iteration order is randomized under the seed; output is independent of df_seq row order.

output_mode='segments' returns one row per sampled window with schema [entry_win, entry, sequence, window, source_position, label, role, strategy]. output_mode='sequences' returns one row per source protein with a labels list of length len(sequence) carrying label_test at positives, label_ref at sampled positions, and None elsewhere.

Examples

Draw fixed-length amino-acid windows from proteins that contain at least one labeled position. The sampled windows are commonly used as negative training rows alongside the positives (PU-learning, hard-negative mining). Positions in pos_col are interpreted as P1-style anchors.

import aaanalysis as aa
import pandas as pd
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2"],
    "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2,
                 "YWVTSRQPNMLKIHGFEDCA" * 2],
    "pos":      [[5, 25], [15]],
})
aaws = aa.AAWindowSampler(random_state=0)

First call — anchor schema. Each row is one sampled window with the eight-column segments schema; entry_win = <entry>_<start>-<end> (1-based inclusive) is globally unique by construction, and source_position is the 1-based P1 anchor that drove the sample.

df = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                 n=6, window_size=5,
                                 min_distance_to_pos=2, seed=0)
aa.display_df(df=df, show_shape=True)

DataFrame shape: (6, 8)

	entry_win	entry	sequence	window	source_position	role	strategy
1	P1_6-10	P1	ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY	GHIKL	8	Negative	same_protein
2	P1_15-19	P1	ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY	RSTVW	17	Negative	same_protein
3	P1_33-37	P1	ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY	PQRST	35	Negative	same_protein
4	P2_20-24	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	AYWVT	22	Negative	same_protein
5	P2_36-40	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	FEDCA	38	Negative	same_protein
6	P2_32-36	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	KIHGF	34	Negative	same_protein

pos_col identifies the positive rows: cells holding a list/tuple/array of 1-based positions. Rows with empty / None / NaN cells are skipped. Sampled windows come from the same proteins as the positives, but at residues away from those positives (see min_distance_to_pos).

Total number of sampled windows across all eligible proteins, and the residue length of each window:

df = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                 n=9, window_size=7, seed=0)
aa.display_df(df=df[["entry_win", "source_position", "window"]], show_shape=True)

DataFrame shape: (9, 3)

	entry_win	source_position	window
1	P1_5-11	8	FGHIKLM
2	P1_34-40	37	QRSTVWY
3	P1_31-37	34	MNPQRST
4	P1_3-9	6	DEFGHIK
5	P1_4-10	7	EFGHIKL
6	P2_17-23	20	EDCAYWV
7	P2_28-34	31	PNMLKIH
8	P2_21-27	24	YWVTSRQ
9	P2_9-15	12	NMLKIHG

Minimum residue distance between a sampled window’s anchor and any positive on the same protein. Larger values push samples further away from the labeled positions:

df = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                 n=6, window_size=5,
                                 min_distance_to_pos=8, seed=0)
aa.display_df(df=df[["entry", "source_position", "window"]], show_shape=True)

DataFrame shape: (6, 3)

	entry	source_position	window
1	P1	17	RSTVW
2	P1	34	NPQRS
3	P1	35	PQRST
4	P2	37	GFEDC
5	P2	7	SRQPN
6	P2	28	RQPNM

Two output schemas:

'segments' (default) — one row per sampled window.
'sequences' — one row per source protein, with a per-residue labels list of length len(sequence) carrying label_test at known positives, label_ref at sampled positions, and None elsewhere. Ready as the target vector for a sliding-window classifier.

df = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                 n=3, window_size=5,
                                 output_mode="sequences", seed=0)
aa.display_df(df=df, show_shape=True)

DataFrame shape: (2, 3)

	entry	sequence	labels
1	P1	ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY	[None, None, None, None, 1, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 1, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None]
2	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	[None, None, None, None, None, None, None, None, None, None, None, None, None, None, 1, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

Semantic tags. Defaults assume PU-learning / hard-negative mining: role='Negative', label_test=1 (applied to positives in output_mode='sequences'), label_ref=0 (applied to sampled reference rows / positions):

df = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                 n=3, window_size=5,
                                 role="background", label_ref=-1, seed=0)
aa.display_df(df=df[["entry_win", "role", "label"]], show_shape=True)

DataFrame shape: (3, 3)

	entry_win	role	label
1	P1_5-9	background	-1
2	P1_35-39	background	-1
3	P2_17-21	background	-1

Filter eligible residues by per-position annotation (e.g. topology, disorder, secondary structure). aa_context_col is a df_seq column whose cells are strings (or sequences) of single-character tags with one tag per residue. context_in whitelists tag values; context_out blacklists them. The three are validated jointly — providing context_in / context_out without aa_context_col raises:

df_seq_topo = df_seq.assign(topo=["MMMMMMMMMMTTTTTTTTTT" * 2,
                                    "TTTTTTTTTTMMMMMMMMMM" * 2])
df = aaws.sample_same_protein(df_seq=df_seq_topo, pos_col="pos",
                                 n=6, window_size=3,
                                 aa_context_col="topo", context_in="T", seed=0)
aa.display_df(df=df[["entry", "source_position", "window"]], show_shape=True)

DataFrame shape: (6, 3)

	entry	source_position	window
1	P1	13	NPQ
2	P1	31	LMN
3	P1	14	PQR
4	P2	6	SRQ
5	P2	7	RQP
6	P2	30	NML

Optional PWM-based filter on the candidate pool. motif_pwm is a position-weight matrix of shape (window_size, 20):

pd.DataFrame (preferred — safer) — columns are the 20 canonical amino acids in any order; reindexed internally.
np.ndarray — columns must be in alphabetical order (ACDEFGHIKLMNPQRSTVWY); the validator cannot detect a wrong order.

motif_score_threshold is the score cutoff (sum of per-position values). motif_match='in' keeps windows with score >= threshold; 'out' keeps the rest. All three are validated jointly:

aa_cols = list("ACDEFGHIKLMNPQRSTVWY")
pwm = pd.DataFrame(0.0, index=range(5), columns=aa_cols)
pwm["A"] = 1.0  # PWM preferring A at every position
df = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                 n=6, window_size=5,
                                 motif_pwm=pwm, motif_score_threshold=2.0,
                                 motif_match="in", seed=0)
aa.display_df(df=df[["entry_win", "window"]], show_shape=True)

DataFrame shape: (0, 2)

	entry_win	window

Constructor-level knobs on AAWindowSampler itself (shared across all four sampling methods):

max_similarity_to_test — drop sampled windows whose per-position identity to any test window (drawn from the positives) exceeds the threshold. Anti-leakage filter.
max_similarity_within_ref — greedily drop sampled windows whose per-position identity to a previously kept sampled window exceeds the threshold. Redundancy reduction.

Set them at construction; iterative re-draw is governed by filter_iteratively (default True) and capped by max_sampling_attempts (default 10):

aaws_strict = aa.AAWindowSampler(random_state=0,
                            max_similarity_to_test=0.6,
                            max_similarity_within_ref=0.8)
df = aaws_strict.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                n=6, window_size=5,
                                min_distance_to_pos=2, seed=0)
aa.display_df(df=df[["entry_win", "window"]], show_shape=True)

DataFrame shape: (6, 2)

	entry_win	window
1	P1_6-10	GHIKL
2	P1_15-19	RSTVW
3	P1_33-37	PQRST
4	P2_20-24	AYWVT
5	P2_36-40	FEDCA
6	P2_32-36	KIHGF

Per-call seed; falls back to the class-level random_state set at construction. Protein iteration order is randomized under the seed, so output depends only on df_seq content + seed, not on row order.

df_a = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                   n=6, window_size=5, seed=7)
df_b = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                   n=6, window_size=5, seed=7)
print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))

deterministic: True

max_distance_to_pos caps the upper distance band — sampled anchors are kept only within [min_distance_to_pos, max_distance_to_pos] residues of the nearest positive, constraining windows to a defined neighborhood of the labeled sites. context_out blacklists residues by their per-position aa_context_col tag (the complement of context_in). label_test is the label written at known positives when output_mode='sequences'.

# max_distance_to_pos: keep only anchors within [2, 6] residues of a positive
df_band = aaws.sample_same_protein(df_seq=df_seq, pos_col="pos",
                                   n=6, window_size=5,
                                   min_distance_to_pos=2, max_distance_to_pos=6,
                                   seed=0)
aa.display_df(df=df_band[["entry", "source_position", "window"]], show_shape=True)

DataFrame shape: (6, 3)

	entry	source_position	window
1	P1	8	GHIKL
2	P1	27	FGHIK
3	P1	9	HIKLM
4	P2	17	GFEDC
5	P2	13	LKIHG
6	P2	21	CAYWV

# context_out: blacklist 'M'-tagged residues; label_test: positive label in 'sequences' mode
df_seq_topo = df_seq.assign(topo=["MMMMMMMMMMTTTTTTTTTT" * 2,
                                  "TTTTTTTTTTMMMMMMMMMM" * 2])
df_ctx = aaws.sample_same_protein(df_seq=df_seq_topo, pos_col="pos",
                                  n=6, window_size=3,
                                  aa_context_col="topo", context_out="M",
                                  label_test=1, label_ref=0,
                                  output_mode="sequences", seed=0)
aa.display_df(df=df_ctx, show_shape=True)

DataFrame shape: (2, 3)

	entry	sequence	labels
1	P1	ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY	[None, None, None, None, 1, None, None, None, None, None, None, None, 0, 0, None, None, None, None, None, None, None, None, None, None, 1, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None]
2	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	[None, None, None, None, None, 0, 0, None, None, None, None, None, None, None, 1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None]