AAWindowSampler.sample_different_protein
- AAWindowSampler.sample_different_protein(df_seq, n=100, window_size=9, pos_col='pos', candidate_proteins=None, label_test=1, label_ref=0, role='Unlabeled', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, motif_pwm=None, motif_score_threshold=None, motif_match='in', seed=None)[source]
Sample windows from proteins outside the test set (proteins with no test positions).
Draws up to
nreference windows exclusively from proteins that carry no labeled positive positions, making them naturally unlabeled candidates for positive-unlabeled learning [ElkanNoto08], [BekkerDavis20]. Use this method alongsidesample_same_protein()to build a combined reference pool that covers both within- and cross-protein negatives.Added in version 1.1.0.
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. See Notes.n (int, default=100) – Maximum total number of sampled windows. Fewer are returned (with a warning) if the eligible space cannot supply.
window_size (int, default=9) – Length of each sampled window in residues.
pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.
candidate_proteins (list of str, optional) – Restrict the candidate pool to these entries.
label_test (int or float, default=1) – Label assigned to positives in
output_mode='sequences'.label_ref (int or float, default=0) – Label assigned to sampled positions / rows.
role (str, default='Unlabeled') – Role tag stored in the output’s
rolecolumn.output_mode ({'segments', 'sequences'}, default='segments') – Output schema. See Notes.
aa_context_col (str, optional) – Per-residue context column used with
context_in/context_out.context_in (value or list-like, optional) – Whitelist of
aa_context_coltag values for eligible residues.context_out (value or list-like, optional) – Blacklist of
aa_context_coltag values for excluded residues.motif_pwm (pd.DataFrame, optional) – Position Weight Matrix (PWM) of shape
(window_size, 20)whose columns are the 20 canonical amino acid (AA) letters in any order (reindexed internally tout.LIST_CANONICAL_AA).motif_score_threshold (float, optional) – PWM score threshold; required when
motif_pwmis set.motif_match ({'in', 'out'}, default='in') –
'in'keeps windows with score>=threshold;'out'keeps the rest.seed (int, optional) – Per-call seed; falls back to the class-level
random_state.
- Returns:
df_seq_out – Sampled windows; one row per window with
entry,sequence,role,strategy, andentry_wincolumns.- Return type:
pd.DataFrame
Notes
df_seqplays a dual role: rows whosepos_colcell is a non-empty list / tuple / array of 1-based positions are positive rows — they are excluded from the candidate pool and contribute their windows only to themax_similarity_to_testfilter. Rows with empty /None/NaNcells form the candidate pool from which the returned windows are drawn.output_mode='segments'returns one row per sampled window with schema[entry_win, entry, sequence, window, source_position, label, role, strategy];output_mode='sequences'returns one row per protein with a per-residuelabelslist carryinglabel_testat the positives of positive proteins,label_refat sampled positions in candidate proteins, andNoneelsewhere — a single mergeable per-residue label vector across calls.Examples
Draw windows from proteins outside the labeled set — naturally suited as the unlabeled pool
Uin positive-unlabeled (PU) learning. Rows ofdf_seqwhosepos_colcell is empty form the candidate pool; rows with positives are excluded from sampling but still contribute their windows to the anti-leakage filter.import aaanalysis as aa import pandas as pd aa.options["verbose"] = False df_seq = pd.DataFrame({ "entry": ["P1", "P2", "P3", "P4"], "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2, "YWVTSRQPNMLKIHGFEDCA" * 2, "MNPQRSTVWYACDEFGHIKL" * 2, "GHIKLMNPQRSTVWYACDEF" * 2], "pos": [[5], [], [], []], }) aaws = aa.AAWindowSampler(random_state=0)
First call — anchor schema.
P1is excluded from the candidate pool because it has a labeled position; the remaining proteins form the eligible pool. The eight-columnsegmentsschema is shared with the other sampling methods.df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=8, window_size=5, seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (8, 8)
entry_win entry sequence window source_position label role strategy 1 P3_20-24 P3 MNPQRSTVWYACDEF...STVWYACDEFGHIKL LMNPQ 22 0 Unlabeled different_protein 2 P4_17-21 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF CDEFG 19 0 Unlabeled different_protein 3 P2_28-32 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA PNMLK 30 0 Unlabeled different_protein 4 P2_6-10 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA RQPNM 8 0 Unlabeled different_protein 5 P2_35-39 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA GFEDC 37 0 Unlabeled different_protein 6 P2_14-18 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA HGFED 16 0 Unlabeled different_protein 7 P4_28-32 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF PQRST 30 0 Unlabeled different_protein 8 P4_18-22 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF DEFGH 20 0 Unlabeled different_protein pos_colplays a dual role: rows with non-empty cells are positive rows (excluded from the pool, used by the anti-leakage filter), and rows with empty cells form the candidate pool. The separation is what makes the output “different proteins from the test set”:# Verify P1 (the only labeled protein) is never in the output df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=10, window_size=5, seed=0) print("entries in output:", sorted(set(df["entry"])))
entries in output: ['P2', 'P3', 'P4']
nis the target total number of sampled windows across the entire candidate pool (not per protein);window_sizeis the residue length per window:df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=4, window_size=8, seed=0) aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (4, 2)
entry_win window 1 P4_16-23 ACDEFGHI 2 P3_4-11 QRSTVWYA 3 P2_21-28 YWVTSRQP 4 P2_6-13 RQPNMLKI Restrict the candidate pool to an explicit subset of
df_seqentries:df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=5, candidate_proteins=["P3", "P4"], seed=0) print("entries in output:", sorted(set(df["entry"])))
entries in output: ['P3', 'P4']
Two output schemas:
'segments'(default) — one row per sampled window.'sequences'— one row per source protein with a per-residuelabelslist (label_testat positives of positive proteins,label_refat sampled positions in candidate proteins,Noneelsewhere) — a single mergeable per-residue label vector across calls.
df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=3, window_size=5, output_mode="sequences", seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (4, 3)
entry sequence labels 1 P1 ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY [None, None, None, None, 1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] 2 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None] 3 P3 MNPQRSTVWYACDEF...STVWYACDEFGHIKL [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] 4 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] Semantic tags. Defaults assume PU-learning:
role='Unlabeled',label_test=1(applied to positives inoutput_mode='sequences'),label_ref=0(applied to sampled rows / positions):df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=4, window_size=5, role="candidate", label_ref=2, seed=0) aa.display_df(df=df[["entry_win", "role", "label"]], show_shape=True)
DataFrame shape: (4, 3)
entry_win role label 1 P3_20-24 candidate 2 2 P4_17-21 candidate 2 3 P2_28-32 candidate 2 4 P2_6-10 candidate 2 Filter eligible residues by per-position annotation.
aa_context_colis adf_seqcolumn with one single-character tag per residue;context_in/context_outwhitelist / blacklist tag values. Providingcontext_in/context_outwithoutaa_context_colraises:df_seq_topo = df_seq.assign(topo=["MMMMMMMMMMTTTTTTTTTT" * 2] * 4) df = aaws.sample_different_protein(df_seq=df_seq_topo, pos_col="pos", n=5, window_size=3, aa_context_col="topo", context_in="T", seed=0) aa.display_df(df=df[["entry_win", "window", "source_position"]], show_shape=True)
DataFrame shape: (5, 3)
entry_win window source_position 1 P4_14-16 WYA 15 2 P2_31-33 LKI 32 3 P3_12-14 CDE 13 4 P2_30-32 MLK 31 5 P3_37-39 HIK 38 Optional PWM-based filter on the candidate pool. Shape and column-order rules are identical to
sample_same_protein(pd.DataFramepreferred overnp.ndarray).motif_match='out'is particularly useful here: it samples windows that explicitly do not match a motif — the inverse ofsample_motif_matched:aa_cols = list("ACDEFGHIKLMNPQRSTVWY") pwm = pd.DataFrame(0.0, index=range(3), columns=aa_cols) pwm["A"] = 1.0 # avoid windows that look like AAA df = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=3, motif_pwm=pwm, motif_score_threshold=1.5, motif_match="out", seed=0) aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (5, 2)
entry_win window 1 P2_24-26 TSR 2 P2_6-8 RQP 3 P2_38-40 DCA 4 P3_27-29 TVW 5 P2_14-16 HGF Per-call seed; falls back to the class-level
random_state. Seeaws_sample_same_proteinfor a demonstration of the class-level anti-leakage / redundancy filters (max_similarity_to_test,max_similarity_within_ref), which apply identically to this method.df_a = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=5, seed=11) df_b = aaws.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=5, seed=11) print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))
deterministic: True