aaanalysis.AAWindowSampler.sample_different_protein
- AAWindowSampler.sample_different_protein(df_seq=None, n=100, window_size=9, pos_col='pos', candidate_proteins=None, label_test=1, label_ref=0, role='Unlabeled', output_mode='segments', aa_context_col=None, context_in=None, context_out=None, motif_pwm=None, motif_score_threshold=None, motif_match='in', seed=None)[source]
Sample windows from proteins outside the test set (proteins with no test positions).
- Parameters:
df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences. See Notes.n (int, default=100) – Maximum total number of sampled windows. Fewer are returned (with a warning) if the eligible space cannot supply.
window_size (int, default=9) – Length of each sampled window in residues.
pos_col (str, default='pos') – Column with per-row 1-based positive positions. See Notes.
candidate_proteins (list of str, optional) – Restrict the candidate pool to these entries.
label_test (int or float, default=1) – Label assigned to positives in
output_mode='sequences'.label_ref (int or float, default=0) – Label assigned to sampled positions / rows.
role (str, default='Unlabeled') – Role tag stored in the output’s
rolecolumn.output_mode ({'segments', 'sequences'}, default='segments') – Output schema. See Notes.
aa_context_col (str, optional) – Per-residue context column used with
context_in/context_out.context_in (value or list-like, optional) – Whitelist of
aa_context_coltag values for eligible residues.context_out (value or list-like, optional) – Blacklist of
aa_context_coltag values for excluded residues.motif_pwm (pd.DataFrame, optional) – Position-weight matrix of shape
(window_size, 20)whose columns are the 20 canonical AA letters in any order (reindexed internally tout.LIST_CANONICAL_AA).motif_score_threshold (float, optional) – PWM score threshold; required when
motif_pwmis set.motif_match ({'in', 'out'}, default='in') –
'in'keeps windows with score>=threshold;'out'keeps the rest.seed (int, optional) – Per-call seed; falls back to the class-level
random_state.
- Returns:
df_seq – Sampled windows; one row per window with
entry,sequence,role,strategy, andentry_wincolumns.- Return type:
pd.DataFrame
Notes
df_seqplays a dual role: rows whosepos_colcell is a non-empty list / tuple / array of 1-based positions are positive rows — they are excluded from the candidate pool and contribute their windows only to themax_similarity_to_testfilter. Rows with empty /None/NaNcells form the candidate pool from which the returned windows are drawn.output_mode='segments'returns one row per sampled window with schema[entry_win, entry, sequence, window, source_position, label, role, strategy];output_mode='sequences'returns one row per protein with a per-residuelabelslist carryinglabel_testat the positives of positive proteins,label_refat sampled positions in candidate proteins, andNoneelsewhere — a single mergeable per-residue label vector across calls.Examples
Draw windows from proteins outside the labeled set — naturally suited as the unlabeled pool
Uin positive-unlabeled (PU) learning. Rows ofdf_seqwhosepos_colcell is empty form the candidate pool; rows with positives are excluded from sampling but still contribute their windows to the anti-leakage filter.import aaanalysis as aa import pandas as pd aa.options["verbose"] = False df_seq = pd.DataFrame({ "entry": ["P1", "P2", "P3", "P4"], "sequence": ["ACDEFGHIKLMNPQRSTVWY" * 2, "YWVTSRQPNMLKIHGFEDCA" * 2, "MNPQRSTVWYACDEFGHIKL" * 2, "GHIKLMNPQRSTVWYACDEF" * 2], "pos": [[5], [], [], []], }) sampler = aa.AAWindowSampler(random_state=0)
First call — anchor schema.
P1is excluded from the candidate pool because it has a labeled position; the remaining proteins form the eligible pool. The eight-columnsegmentsschema is shared with the other sampling methods.df = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=8, window_size=5, seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (8, 8)
entry_win entry sequence window source_position label role strategy 1 P3_20-24 P3 MNPQRSTVWYACDEF...STVWYACDEFGHIKL LMNPQ 22 0 Unlabeled different_protein 2 P4_17-21 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF CDEFG 19 0 Unlabeled different_protein 3 P2_28-32 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA PNMLK 30 0 Unlabeled different_protein 4 P2_6-10 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA RQPNM 8 0 Unlabeled different_protein 5 P2_35-39 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA GFEDC 37 0 Unlabeled different_protein 6 P2_14-18 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA HGFED 16 0 Unlabeled different_protein 7 P4_28-32 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF PQRST 30 0 Unlabeled different_protein 8 P4_18-22 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF DEFGH 20 0 Unlabeled different_protein pos_colplays a dual role: rows with non-empty cells are positive rows (excluded from the pool, used by the anti-leakage filter), and rows with empty cells form the candidate pool. The separation is what makes the output “different proteins from the test set”:# Verify P1 (the only labeled protein) is never in the output df = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=10, window_size=5, seed=0) print("entries in output:", sorted(set(df["entry"])))
entries in output: ['P2', 'P3', 'P4']
nis the target total number of sampled windows across the entire candidate pool (not per protein);window_sizeis the residue length per window:df = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=4, window_size=8, seed=0) aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (4, 2)
entry_win window 1 P4_16-23 ACDEFGHI 2 P3_4-11 QRSTVWYA 3 P2_21-28 YWVTSRQP 4 P2_6-13 RQPNMLKI Restrict the candidate pool to an explicit subset of
df_seqentries:df = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=5, candidate_proteins=["P3", "P4"], seed=0) print("entries in output:", sorted(set(df["entry"])))
entries in output: ['P3', 'P4']
Two output schemas:
'segments'(default) — one row per sampled window.'sequences'— one row per source protein with a per-residuelabelslist (label_testat positives of positive proteins,label_refat sampled positions in candidate proteins,Noneelsewhere) — a single mergeable per-residue label vector across calls.
df = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=3, window_size=5, output_mode="sequences", seed=0) aa.display_df(df=df, show_shape=True)
DataFrame shape: (4, 3)
entry sequence labels 1 P1 ACDEFGHIKLMNPQR...GHIKLMNPQRSTVWY [None, None, None, None, 1, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] 2 P2 YWVTSRQPNMLKIHG...RQPNMLKIHGFEDCA [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None] 3 P3 MNPQRSTVWYACDEF...STVWYACDEFGHIKL [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] 4 P4 GHIKLMNPQRSTVWY...MNPQRSTVWYACDEF [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 0, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None] Semantic tags. Defaults assume PU-learning:
role='Unlabeled',label_test=1(applied to positives inoutput_mode='sequences'),label_ref=0(applied to sampled rows / positions):df = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=4, window_size=5, role="candidate", label_ref=2, seed=0) aa.display_df(df=df[["entry_win", "role", "label"]], show_shape=True)
DataFrame shape: (4, 3)
entry_win role label 1 P3_20-24 candidate 2 2 P4_17-21 candidate 2 3 P2_28-32 candidate 2 4 P2_6-10 candidate 2 Filter eligible residues by per-position annotation.
aa_context_colis adf_seqcolumn with one single-character tag per residue;context_in/context_outwhitelist / blacklist tag values. Providingcontext_in/context_outwithoutaa_context_colraises:df_seq_topo = df_seq.assign(topo=["MMMMMMMMMMTTTTTTTTTT" * 2] * 4) df = sampler.sample_different_protein(df_seq=df_seq_topo, pos_col="pos", n=5, window_size=3, aa_context_col="topo", context_in="T", seed=0) aa.display_df(df=df[["entry_win", "window", "source_position"]], show_shape=True)
DataFrame shape: (5, 3)
entry_win window source_position 1 P4_14-16 WYA 15 2 P2_31-33 LKI 32 3 P3_12-14 CDE 13 4 P2_30-32 MLK 31 5 P3_37-39 HIK 38 Optional PWM-based filter on the candidate pool. Shape and column-order rules are identical to
sample_same_protein(pd.DataFramepreferred overnp.ndarray).motif_match='out'is particularly useful here: it samples windows that explicitly do not match a motif — the inverse ofsample_motif_matched:aa_cols = list("ACDEFGHIKLMNPQRSTVWY") pwm = pd.DataFrame(0.0, index=range(3), columns=aa_cols) pwm["A"] = 1.0 # avoid windows that look like AAA df = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=3, motif_pwm=pwm, motif_score_threshold=1.5, motif_match="out", seed=0) aa.display_df(df=df[["entry_win", "window"]], show_shape=True)
DataFrame shape: (5, 2)
entry_win window 1 P2_24-26 TSR 2 P2_6-8 RQP 3 P2_38-40 DCA 4 P3_27-29 TVW 5 P2_14-16 HGF Per-call seed; falls back to the class-level
random_state. Seeaws_sample_same_proteinfor a demonstration of the class-level anti-leakage / redundancy filters (max_similarity_to_test,max_similarity_within_ref), which apply identically to this method.df_a = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=5, seed=11) df_b = sampler.sample_different_protein(df_seq=df_seq, pos_col="pos", n=5, window_size=5, seed=11) print("deterministic:", list(df_a["entry_win"]) == list(df_b["entry_win"]))
deterministic: True