SequenceFeature.get_labels_tiered

static SequenceFeature.get_labels_tiered(targets=None, q_pos=0.8, list_q_neg=(0.8, 0.5, 0.3), df_parts=None, dict_num_parts=None, label_test=1, label_ref=0)[source]

Build tiered binary labels sharing a fixed positive set, with row-matched parts.

Holds the positive set fixed at targets >= Q(q_pos) and sweeps a series of negative cuts targets <= Q(q_neg) for each q_neg in list_q_neg, dropping the middle band each time. This compares CPP settings against the same positives while the negatives move toward more extreme low values. Like get_labels_ovo(), each tier drops samples, so this method applies the selection for you: pass df_parts (for CPP.run()) and/or dict_num_parts (for CPP.run_num()) and it returns, per tier, the row-matched copies alongside the binary labels.

Added in version 1.1.0.

Parameters:
  • targets (array-like, shape (n_samples,)) – Continuous target values for samples.

  • q_pos (float, default=0.8) – Quantile in (0, 1) defining the fixed positive cut: positives are targets >= Q(q_pos).

  • list_q_neg (sequence of float, default=(0.8, 0.5, 0.3)) – Quantiles in (0, 1); for each, negatives are targets <= Q(q_neg) (positives take precedence on ties) and the middle band is dropped.

  • df_parts (pd.DataFrame, optional) – Parts table (CPP.run value source) aligned row-wise with targets. When given, the returned per-tier copy is subset to that tier’s samples.

  • dict_num_parts (dict, optional) – Per-part numerical tensors (CPP.run_num value source, {part: array} with sample axis first) aligned row-wise with targets. Subset per tier like df_parts. At least one of df_parts / dict_num_parts is required.

  • label_test (int, default=1) – Value assigned to positive samples.

  • label_ref (int, default=0) – Value assigned to negative samples.

Returns:

dict_labels – Dictionary mapping each q_neg to a tuple (df_parts_tier, dict_num_parts_tier, labels_tier): the row-matched df_parts copy (or None if not supplied), the row-matched dict_num_parts copy (or None if not supplied), and the binary label array over that tier’s samples. The original inputs are never modified.

Return type:

dict

Notes

  • Raises ValueError if any tier yields only one class (e.g. q_neg above q_pos leaving no negatives).

  • The selection is applied positionally; df_parts_tier.index records which original rows the tier retained.

  • Complexity: O(n_samples log n_samples x n_tiers).

See also

Examples

get_labels_tiered keeps the positive set fixed (targets >= Q(q_pos)) and sweeps stepwise-lower negative cuts (targets <= Q(q_neg)), dropping the middle band each tier. Pass the value source and each tier returns its row-matched df_parts / dict_num_parts subset plus the binary labels:

import numpy as np
import pandas as pd
import aaanalysis as aa

targets = np.linspace(0, 1, 40)
df_parts = pd.DataFrame({"tmd": [f"AC{i:02d}" for i in range(40)]})
tiers = aa.SequenceFeature.get_labels_tiered(targets, q_pos=0.8, list_q_neg=[0.8, 0.5, 0.3], df_parts=df_parts)
{q: (len(dfp), int(y.sum())) for q, (dfp, _, y) in tiers.items()}  # q_neg -> (n_selected, n_positive)
{0.8: (40, 8), 0.5: (28, 8), 0.3: (20, 8)}

Like OvO, each tier already carries its row-matched df_parts, so build a CPP per tier directly:

df_seq = aa.load_dataset(name="DOM_GSEC", n=12)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
targets = np.linspace(0, 1, len(df_parts))

for q_neg, (df_tier, _, binary) in aa.SequenceFeature.get_labels_tiered(
        targets, q_pos=0.7, list_q_neg=[0.5, 0.3], df_parts=df_parts).items():
    df_feat = aa.CPP(df_parts=df_tier).run(labels=binary, n_filter=3)
    print(f"q_neg={q_neg}: {len(df_tier)} samples -> {len(df_feat)} features")
1. CPP creates 580140 features for 19 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=520556 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
q_neg=0.5: 19 samples -> 3 features
1. CPP creates 580140 features for 14 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=520556 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
q_neg=0.3: 14 samples -> 3 features

What can go wrong? A tier that leaves no negatives (or constant targets) raises (value source supplied so the check is reached):

try:
    df_parts = pd.DataFrame({"tmd": list("ACGT")})
    aa.SequenceFeature.get_labels_tiered([5.0, 5.0, 5.0, 5.0], q_pos=0.8, list_q_neg=[0.3], df_parts=df_parts)
except ValueError as e:
    print("ValueError:", e)
ValueError: tier q_neg=0.3 yields a single class (n_pos=4, n_neg=0); choose q_pos/q_neg that keep both groups non-empty.