SequenceFeature.get_labels_tiered
- static SequenceFeature.get_labels_tiered(targets=None, q_pos=0.8, list_q_neg=(0.8, 0.5, 0.3), df_parts=None, dict_num_parts=None, label_test=1, label_ref=0)[source]
Build tiered binary labels sharing a fixed positive set, with row-matched parts.
Holds the positive set fixed at
targets >= Q(q_pos)and sweeps a series of negative cutstargets <= Q(q_neg)for eachq_neginlist_q_neg, dropping the middle band each time. This compares CPP settings against the same positives while the negatives move toward more extreme low values. Likeget_labels_ovo(), each tier drops samples, so this method applies the selection for you: passdf_parts(forCPP.run()) and/ordict_num_parts(forCPP.run_num()) and it returns, per tier, the row-matched copies alongside the binary labels.Added in version 1.1.0.
- Parameters:
targets (array-like, shape (n_samples,)) – Continuous target values for samples.
q_pos (float, default=0.8) – Quantile in (0, 1) defining the fixed positive cut: positives are
targets >= Q(q_pos).list_q_neg (sequence of float, default=(0.8, 0.5, 0.3)) – Quantiles in (0, 1); for each, negatives are
targets <= Q(q_neg)(positives take precedence on ties) and the middle band is dropped.df_parts (pd.DataFrame, optional) – Parts table (
CPP.runvalue source) aligned row-wise withtargets. When given, the returned per-tier copy is subset to that tier’s samples.dict_num_parts (dict, optional) – Per-part numerical tensors (
CPP.run_numvalue source,{part: array}with sample axis first) aligned row-wise withtargets. Subset per tier likedf_parts. At least one ofdf_parts/dict_num_partsis required.label_test (int, default=1) – Value assigned to positive samples.
label_ref (int, default=0) – Value assigned to negative samples.
- Returns:
dict_labels – Dictionary mapping each
q_negto a tuple(df_parts_tier, dict_num_parts_tier, labels_tier): the row-matcheddf_partscopy (orNoneif not supplied), the row-matcheddict_num_partscopy (orNoneif not supplied), and the binary label array over that tier’s samples. The original inputs are never modified.- Return type:
Notes
Raises
ValueErrorif any tier yields only one class (e.g.q_negaboveq_posleaving no negatives).The selection is applied positionally;
df_parts_tier.indexrecords which original rows the tier retained.Complexity: O(n_samples log n_samples x n_tiers).
See also
aaanalysis.CPP: consumes each tier’sdf_parts_tier/dict_num_parts_tierandlabels_tier.get_labels_quantile(): single-cut variant that keeps all samples.
Examples
get_labels_tieredkeeps the positive set fixed (targets >= Q(q_pos)) and sweeps stepwise-lower negative cuts (targets <= Q(q_neg)), dropping the middle band each tier. Pass the value source and each tier returns its row-matcheddf_parts/dict_num_partssubset plus the binary labels:import numpy as np import pandas as pd import aaanalysis as aa targets = np.linspace(0, 1, 40) df_parts = pd.DataFrame({"tmd": [f"AC{i:02d}" for i in range(40)]}) tiers = aa.SequenceFeature.get_labels_tiered(targets, q_pos=0.8, list_q_neg=[0.8, 0.5, 0.3], df_parts=df_parts) {q: (len(dfp), int(y.sum())) for q, (dfp, _, y) in tiers.items()} # q_neg -> (n_selected, n_positive)
{0.8: (40, 8), 0.5: (28, 8), 0.3: (20, 8)}
Like OvO, each tier already carries its row-matched
df_parts, so build aCPPper tier directly:df_seq = aa.load_dataset(name="DOM_GSEC", n=12) sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) targets = np.linspace(0, 1, len(df_parts)) for q_neg, (df_tier, _, binary) in aa.SequenceFeature.get_labels_tiered( targets, q_pos=0.7, list_q_neg=[0.5, 0.3], df_parts=df_parts).items(): df_feat = aa.CPP(df_parts=df_tier).run(labels=binary, n_filter=3) print(f"q_neg={q_neg}: {len(df_tier)} samples -> {len(df_feat)} features")
[94m1. CPP creates 580140 features for 19 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=520556 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m q_neg=0.5: 19 samples -> 3 features [94m1. CPP creates 580140 features for 14 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=520556 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m q_neg=0.3: 14 samples -> 3 features
What can go wrong? A tier that leaves no negatives (or constant targets) raises (value source supplied so the check is reached):
try: df_parts = pd.DataFrame({"tmd": list("ACGT")}) aa.SequenceFeature.get_labels_tiered([5.0, 5.0, 5.0, 5.0], q_pos=0.8, list_q_neg=[0.3], df_parts=df_parts) except ValueError as e: print("ValueError:", e)
ValueError: tier q_neg=0.3 yields a single class (n_pos=4, n_neg=0); choose q_pos/q_neg that keep both groups non-empty.