SequenceFeature.get_labels_tiered

static SequenceFeature.get_labels_tiered(targets, q_pos=0.8, list_q_neg=(0.8, 0.5, 0.3), df_parts=None, dict_num_parts=None, label_test=1, label_ref=0)[source]

Build tiered binary labels sharing a fixed positive set, with row-matched parts.

Holds the positive set fixed at targets >= Q(q_pos) and sweeps a series of negative cuts targets <= Q(q_neg) for each q_neg in list_q_neg, dropping the middle band each time. This compares CPP settings against the same positives while the negatives move toward more extreme low values. Like get_labels_ovo(), each tier drops samples, so this method applies the selection for you: pass df_parts (for CPP.run()) and/or dict_num_parts (for CPP.run_num()) and it returns, per tier, the row-matched copies alongside the binary labels.

Added in version 1.1.0.

Parameters:

targets (array-like, shape (n_samples,)) – Continuous target values for samples.
q_pos (float, default=0.8) – Quantile in (0, 1) defining the fixed positive cut: positives are targets >= Q(q_pos).
list_q_neg (sequence of float, default=(0.8, 0.5, 0.3)) – Quantiles in (0, 1); for each, negatives are targets <= Q(q_neg) (positives take precedence on ties) and the middle band is dropped.
df_parts (pd.DataFrame, optional) – Parts table (CPP.run value source) aligned row-wise with targets. When given, the returned per-tier copy is subset to that tier’s samples.
dict_num_parts (dict, optional) – Per-part numerical tensors (CPP.run_num value source, {part: array} with sample axis first) aligned row-wise with targets. Subset per tier like df_parts. At least one of df_parts / dict_num_parts is required.
label_test (int, default=1) – Value assigned to positive samples.
label_ref (int, default=0) – Value assigned to negative samples.

Returns:

dict_labels – Dictionary mapping each q_neg to a tuple (df_parts_tier, dict_num_parts_tier, labels_tier): the row-matched df_parts copy (or None if not supplied), the row-matched dict_num_parts copy (or None if not supplied), and the binary label array over that tier’s samples. The original inputs are never modified.

Return type:

dict

Notes

Raises ValueError if any tier yields only one class (e.g. q_neg above q_pos leaving no negatives).
The selection is applied positionally; df_parts_tier.index records which original rows the tier retained.
Complexity: O(n_samples log n_samples x n_tiers).

See also

aaanalysis.CPP: consumes each tier’s df_parts_tier / dict_num_parts_tier and labels_tier.
get_labels_quantile(): single-cut variant that keeps all samples.

Examples

get_labels_tiered keeps the positive set fixed (targets >= Q(q_pos)) and sweeps stepwise-lower negative cuts (targets <= Q(q_neg)), dropping the middle band each tier. Pass the value source and each tier returns its row-matched df_parts / dict_num_parts subset plus the binary labels:

import numpy as np
import pandas as pd
import aaanalysis as aa

targets = np.linspace(0, 1, 40)
df_parts = pd.DataFrame({"tmd": [f"AC{i:02d}" for i in range(40)]})
tiers = aa.SequenceFeature.get_labels_tiered(targets, q_pos=0.8, list_q_neg=[0.8, 0.5, 0.3], df_parts=df_parts)
{q: (len(dfp), int(y.sum())) for q, (dfp, _, y) in tiers.items()}  # q_neg -> (n_selected, n_positive)

{0.8: (40, 8), 0.5: (28, 8), 0.3: (20, 8)}

Like OvO, each tier already carries its row-matched df_parts, so build a CPP per tier directly:

df_seq = aa.load_dataset(name="DOM_GSEC", n=12)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
targets = np.linspace(0, 1, len(df_parts))

for q_neg, (df_tier, _, binary) in aa.SequenceFeature.get_labels_tiered(
        targets, q_pos=0.7, list_q_neg=[0.5, 0.3], df_parts=df_parts).items():
    df_feat = aa.CPP(df_parts=df_tier).run(labels=binary, n_filter=3)
    print(f"q_neg={q_neg}: {len(df_tier)} samples -> {len(df_feat)} features")

[94m1. CPP creates 580140 features for 19 samples
1.1 Assigning scale values to parts[0m
   |.........................| 100.0%[0m[94m
1.2 Streaming pre-filter stats (mask in stream)[0m
   |.........................| 100.0%[0m
[94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=520556 of 580140)[0m
[94m3. CPP filtering algorithm[0m
[94m4. CPP returns df of 3 unique features with general information and statistics[0m
q_neg=0.5: 19 samples -> 3 features
[94m1. CPP creates 580140 features for 14 samples
1.1 Assigning scale values to parts[0m
   |.........................| 100.0%[0m[94m
1.2 Streaming pre-filter stats (mask in stream)[0m
   |.........................| 100.0%[0m
[94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=520556 of 580140)[0m
[94m3. CPP filtering algorithm[0m
[94m4. CPP returns df of 3 unique features with general information and statistics[0m
q_neg=0.3: 14 samples -> 3 features

What can go wrong? A tier that leaves no negatives (or constant targets) raises (value source supplied so the check is reached):

try:
    df_parts = pd.DataFrame({"tmd": list("ACGT")})
    aa.SequenceFeature.get_labels_tiered([5.0, 5.0, 5.0, 5.0], q_pos=0.8, list_q_neg=[0.3], df_parts=df_parts)
except ValueError as e:
    print("ValueError:", e)

ValueError: tier q_neg=0.3 yields a single class (n_pos=4, n_neg=0); choose q_pos/q_neg that keep both groups non-empty.

Further parameters. As for OvO, a numerical dict_num_parts value source is subset per tier, and label_test / label_ref set the positive / reference label values:

targets_ex = np.linspace(0, 1, 8)
df_parts_ex = pd.DataFrame({"tmd": [f"AC{i:02d}" for i in range(8)]})
dict_num_parts = {"tmd": np.random.default_rng(0).random((8, 4, 2))}
tiers = aa.SequenceFeature.get_labels_tiered(targets_ex, q_pos=0.7, list_q_neg=[0.5, 0.3],
                                             df_parts=df_parts_ex, dict_num_parts=dict_num_parts,
                                             label_test=1, label_ref=0)
{q: (dnp["tmd"].shape, int(y.sum())) for q, (_, dnp, y) in tiers.items()}

{0.5: ((7, 4, 2), 3), 0.3: ((6, 4, 2), 3)}