AAWindowSampler.sample_different_protein

AAWindowSampler.sample_different_protein(df_seq, n=100, window_size=9, pos_col='pos', candidate_proteins=None, label_test=1, label_ref=0, role='Unlabeled', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, motif_pwm=None, motif_score_threshold=None, motif_match='in', seed=None)[source]

Sample windows from proteins outside the test set (proteins with no test positions).

Draws up to n reference windows exclusively from proteins that carry no labeled positive positions, making them naturally unlabeled candidates for positive-unlabeled learning [ElkanNoto08], [BekkerDavis20]. Use this method alongside sample_same_protein() to build a combined reference pool that covers both within- and cross-protein negatives.

Added in version 1.1.0.

Parameters:

df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences. See Notes.
n (int, default=100) – Maximum total number of sampled windows. Fewer are returned (with a warning) if the eligible space cannot supply.
window_size (int, default=9) – Length of each sampled window in residues.
pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.
candidate_proteins (list of str, optional) – Restrict the candidate pool to these entries.
label_test (int or float, default=1) – Label assigned to positives in output_mode='sequences'.
label_ref (int or float, default=0) – Label assigned to sampled positions / rows.
role (str, default='Unlabeled') – Role tag stored in the output’s role column.
output_mode ({'segments', 'sequences'}, default='segments') – Output schema. See Notes.
aa_context_col (str, optional) – Per-residue context column used with context_in / context_out.
context_in (value or list-like, optional) – Whitelist of aa_context_col tag values for eligible residues.
context_out (value or list-like, optional) – Blacklist of aa_context_col tag values for excluded residues.
motif_pwm (pd.DataFrame, optional) – Position Weight Matrix (PWM) of shape (window_size, 20) whose columns are the 20 canonical amino acid (AA) letters in any order (reindexed internally to ut.LIST_CANONICAL_AA).
motif_score_threshold (float, optional) – PWM score threshold; required when motif_pwm is set.
motif_match ({'in', 'out'}, default='in') – 'in' keeps windows with score >= threshold; 'out' keeps the rest.
seed (int, optional) – Per-call seed; falls back to the class-level random_state.

Returns:

df_seq_out – Sampled windows; one row per window with entry, sequence, role, strategy, and entry_win columns.

Return type:

pd.DataFrame

Notes

df_seq plays a dual role: rows whose pos_col cell is a non-empty list / tuple / array of 1-based positions are positive rows — they are excluded from the candidate pool and contribute their windows only to the max_similarity_to_test filter. Rows with empty / None / NaN cells form the candidate pool from which the returned windows are drawn.

output_mode='segments' returns one row per sampled window with schema [entry_win, entry, sequence, window, source_position, label, role, strategy]; output_mode='sequences' returns one row per protein with a per-residue labels list carrying label_test at the positives of positive proteins, label_ref at sampled positions in candidate proteins, and None elsewhere — a single mergeable per-residue label vector across calls.

Examples

Draw windows from proteins outside the labeled set — naturally suited as the unlabeled pool U in positive-unlabeled (PU) learning. Rows of df_seq whose pos_col cell is empty form the candidate pool; rows with positives are excluded from sampling but still contribute their windows to the anti-leakage filter.

import aaanalysis as aa
import pandas as pd
aa.options["verbose"] = False

df_seq = pd.DataFrame({
    "entry":    ["P1", "P2", "P3", "P4"],
    "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2,
                 "YWVTSRQPNMLKIHGFEDCA" * 2,
                 "MNPQRSTVWYACDEFGHIKL" * 2,
                 "GHIKLMNPQRSTVWYACDEF" * 2],
    "pos":      [[5], [], [], []],
})
aaws = aa.AAWindowSampler(random_state=0)

First call — anchor schema. P1 is excluded from the candidate pool because it has a labeled position; the remaining proteins form the eligible pool. The eight-column segments schema is shared with the other sampling methods.

df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                      n=8, window_size=5, seed=0)
aa.display_df(df=df, show_shape=True)

DataFrame shape: (8, 8)

	entry_win	entry	sequence	window	source_position	role	strategy
1	P3_20-24	P3	MNPQRSTVWYACDEF...STVWYACDEFGHIKL	LMNPQ	22	Unlabeled	different_protein
2	P4_17-21	P4	GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF	CDEFG	19	Unlabeled	different_protein
3	P2_28-32	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	PNMLK	30	Unlabeled	different_protein
4	P2_6-10	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	RQPNM	8	Unlabeled	different_protein
5	P2_35-39	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	GFEDC	37	Unlabeled	different_protein
6	P2_14-18	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	HGFED	16	Unlabeled	different_protein
7	P4_28-32	P4	GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF	PQRST	30	Unlabeled	different_protein
8	P4_18-22	P4	GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF	DEFGH	20	Unlabeled	different_protein

pos_col plays a dual role: rows with non-empty cells are positive rows (excluded from the pool, used by the anti-leakage filter), and rows with empty cells form the candidate pool. The separation is what makes the output “different proteins from the test set”:

# Verify P1 (the only labeled protein) is never in the output
df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                      n=10, window_size=5, seed=0)
print("entries in output:", sorted(set(df["entry"])))

entries in output: ['P2', 'P3', 'P4']

n is the target total number of sampled windows across the entire candidate pool (not per protein); window_size is the residue length per window:

df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                      n=4, window_size=8, seed=0)
aa.display_df(df=df[["entry_win", "window"]], show_shape=True)

DataFrame shape: (4, 2)

	entry_win	window
1	P4_16-23	ACDEFGHI
2	P3_4-11	QRSTVWYA
3	P2_21-28	YWVTSRQP
4	P2_6-13	RQPNMLKI

Restrict the candidate pool to an explicit subset of df_seq entries:

df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                      n=5, window_size=5,
                                      candidate_proteins=["P3", "P4"], seed=0)
print("entries in output:", sorted(set(df["entry"])))

entries in output: ['P3', 'P4']

Two output schemas:

'segments' (default) — one row per sampled window.
'sequences' — one row per source protein with a per-residue labels list (label_test at positives of positive proteins, label_ref at sampled positions in candidate proteins, None elsewhere) — a single mergeable per-residue label vector across calls.

df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                      n=3, window_size=5,
                                      output_mode="sequences", seed=0)
aa.display_df(df=df, show_shape=True)

DataFrame shape: (4, 3)

	entry	sequence	labels
1	P1	ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY	[None, None, None, None, 1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
2	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None]
3	P3	MNPQRSTVWYACDEF...STVWYACDEFGHIKL	[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
4	P4	GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF	[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]

Semantic tags. Defaults assume PU-learning: role='Unlabeled', label_test=1 (applied to positives in output_mode='sequences'), label_ref=0 (applied to sampled rows / positions):

df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                      n=4, window_size=5,
                                      role="candidate", label_ref=2, seed=0)
aa.display_df(df=df[["entry_win", "role", "label"]], show_shape=True)

DataFrame shape: (4, 3)

	entry_win	role	label
1	P3_20-24	candidate	2
2	P4_17-21	candidate	2
3	P2_28-32	candidate	2
4	P2_6-10	candidate	2

Filter eligible residues by per-position annotation. aa_context_col is a df_seq column with one single-character tag per residue; context_in / context_out whitelist / blacklist tag values. Providing context_in / context_out without aa_context_col raises:

df_seq_topo = df_seq.assign(topo=["MMMMMMMMMMTTTTTTTTTT" * 2] * 4)
df = aaws.sample_different_protein(df_seq=df_seq_topo, pos_col="pos",
                                      n=5, window_size=3,
                                      aa_context_col="topo", context_in="T", seed=0)
aa.display_df(df=df[["entry_win", "window", "source_position"]], show_shape=True)

DataFrame shape: (5, 3)

	entry_win	window	source_position
1	P4_14-16	WYA	15
2	P2_31-33	LKI	32
3	P3_12-14	CDE	13
4	P2_30-32	MLK	31
5	P3_37-39	HIK	38

Optional PWM-based filter on the candidate pool. Shape and column-order rules are identical to sample_same_protein (pd.DataFrame preferred over np.ndarray). motif_match='out' is particularly useful here: it samples windows that explicitly do not match a motif — the inverse of sample_motif_matched:

aa_cols = list("ACDEFGHIKLMNPQRSTVWY")
pwm = pd.DataFrame(0.0, index=range(3), columns=aa_cols)
pwm["A"] = 1.0  # avoid windows that look like AAA
df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                      n=5, window_size=3,
                                      motif_pwm=pwm, motif_score_threshold=1.5,
                                      motif_match="out", seed=0)
aa.display_df(df=df[["entry_win", "window"]], show_shape=True)

DataFrame shape: (5, 2)

	entry_win	window
1	P2_24-26	TSR
2	P2_6-8	RQP
3	P2_38-40	DCA
4	P3_27-29	TVW
5	P2_14-16	HGF

Per-call seed; falls back to the class-level random_state. See aws_sample_same_protein for a demonstration of the class-level anti-leakage / redundancy filters (max_similarity_to_test, max_similarity_within_ref), which apply identically to this method.

df_a = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                        n=5, window_size=5, seed=11)
df_b = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos",
                                        n=5, window_size=5, seed=11)
print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))

deterministic: True

context_out blacklists residues by their per-position aa_context_col tag (the complement of context_in). label_test is the label written at the positives of positive proteins when output_mode='sequences'.

# context_out blacklists 'M'-tagged residues; label_test sets the positive label
df_seq_topo = df_seq.assign(topo=["MMMMMMMMMMTTTTTTTTTT" * 2] * 4)
df_ctx = aaws.sample_different_protein(df_seq=df_seq_topo, pos_col="pos",
                                       n=5, window_size=3,
                                       aa_context_col="topo", context_out="M",
                                       label_test=1, label_ref=0,
                                       output_mode="sequences", seed=0)
aa.display_df(df=df_ctx, show_shape=True)

DataFrame shape: (4, 3)

	entry	sequence	labels
1	P1	ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY	[None, None, None, None, 1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]
2	P2	YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA	[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, 0, None, None, None, None, None, None, None, None]
3	P3	MNPQRSTVWYACDEF...STVWYACDEFGHIKL	[None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None]
4	P4	GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF	[None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]