SequenceFeature.get_labels_ovo

static SequenceFeature.get_labels_ovo(labels=None, df_parts=None, dict_num_parts=None, label_test=1, label_ref=0)[source]

Convert multi-class labels into one-vs-one (OvO) binary labels with row-matched parts.

One-vs-one (OvO) maps each unordered pair of classes (a, b) to the subset of samples belonging to either class together with a binary label array over that subset (class a as test, class b as reference). Because the other classes are discarded, each pair needs its own row subset of the value source. This method applies the selection for you: pass df_parts (for CPP.run()) and/or dict_num_parts (for CPP.run_num()) and it returns, per pair, the row-matched copies alongside the binary labels — ready to drop straight into a CPP instance built on that pair.

Added in version 1.1.0.

Parameters:
  • labels (array-like, shape (n_samples,)) – Multi-class labels for samples. Must be integers (more than one distinct value).

  • df_parts (pd.DataFrame, optional) – Parts table (CPP.run value source) aligned row-wise with labels. When given, the returned per-pair copy is subset to that pair’s samples.

  • dict_num_parts (dict, optional) – Per-part numerical tensors (CPP.run_num value source, {part: array} with sample axis first) aligned row-wise with labels. Subset per pair like df_parts. At least one of df_parts / dict_num_parts is required.

  • label_test (int, default=1) – Value assigned to the first class of each pair.

  • label_ref (int, default=0) – Value assigned to the second class of each pair.

Returns:

dict_labels – Dictionary mapping each class pair (a, b) to a tuple (df_parts_pair, dict_num_parts_pair, labels_pair): the row-matched df_parts copy (or None if not supplied), the row-matched dict_num_parts copy (or None if not supplied), and the binary label array over that pair’s samples. The original inputs are never modified.

Return type:

dict

Notes

  • The selection is applied positionally; df_parts_pair.index records which original rows the pair retained.

  • Complexity: O(n_samples x n_classes^2): K classes produce K(K-1)/2 pairs (K=10 -> 45, K=20 -> 190), each needing its own CPP instance. Prefer OvO for small K (~<10) and get_labels_ovr() for larger problems.

See also

  • aaanalysis.CPP: consumes each pair’s df_parts_pair / dict_num_parts_pair and labels_pair.

  • get_labels_ovr(): one-vs-rest alternative that keeps all samples.

Examples

One-vs-one (OvO) maps each class pair (a, b) to that pair’s samples (the other classes are dropped). Pass the value source — df_parts (for CPP.run) and/or dict_num_parts (for CPP.run_num) — and OvO returns, per pair, the row-matched copy plus the binary labels, ready to drop into CPP (no mask to apply yourself):

import numpy as np
import pandas as pd
import aaanalysis as aa

labels = [0, 0, 1, 1, 2, 2]
df_parts = pd.DataFrame({"jmd_n": list("ACDEFG"), "tmd": list("HIKLMN"), "jmd_c": list("PQRSTV")})
# each pair -> (df_parts_pair, dict_num_parts_pair, labels_pair)
{pair: (len(dfp), y.tolist()) for pair, (dfp, _, y) in
 aa.SequenceFeature.get_labels_ovo(labels, df_parts=df_parts).items()}
{(0, 1): (4, [1, 1, 0, 0]),
 (0, 2): (4, [1, 1, 0, 0]),
 (1, 2): (4, [1, 1, 0, 0])}

Full OvO workflow: each pair already carries its row-matched df_parts, so build a CPP on it directly, run, and collect:

df_seq = aa.load_dataset(name="DOM_GSEC", n=9)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
multiclass = np.array([i % 3 for i in range(len(df_parts))])

frames = []
for (a, b), (df_pair, _, binary) in aa.SequenceFeature.get_labels_ovo(multiclass, df_parts=df_parts).items():
    df_feat = aa.CPP(df_parts=df_pair).run(labels=binary, n_filter=3)
    df_feat["pair"] = f"({a},{b})"
    frames.append(df_feat)
pd.concat(frames, ignore_index=True)[["pair", "feature", "abs_auc"]].head()
1. CPP creates 580140 features for 12 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
1. CPP creates 580140 features for 12 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
1. CPP creates 580140 features for 12 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=516049 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
pair feature abs_auc
0 (0,1) JMD_N_TMD_N-Pattern(C,4,8)-PALJ810106 0.5
1 (0,1) TMD-Pattern(N,5,9)-PALJ810106 0.5
2 (0,1) TMD-Segment(4,12)-PALJ810114 0.5
3 (0,2) JMD_N_TMD_N-Pattern(N,5,9)-AURR980115 0.5
4 (0,2) TMD-Pattern(N,4,7,11)-QIAN880105 0.5

What can go wrong? As for OvR, labels must be integers with more than one class:

try:
    aa.SequenceFeature.get_labels_ovo([1, 1, 1])
except ValueError as e:
    print("ValueError:", e)
ValueError: 'labels' should contain more than one different value ({np.int64(1)}).