SequenceFeature.get_labels_ovo

static SequenceFeature.get_labels_ovo(labels, df_parts=None, dict_num_parts=None, label_test=1, label_ref=0)[source]

Convert multi-class labels into one-vs-one (OvO) binary labels with row-matched parts.

One-vs-one (OvO) maps each unordered pair of classes (a, b) to the subset of samples belonging to either class together with a binary label array over that subset (class a as test, class b as reference). Because the other classes are discarded, each pair needs its own row subset of the value source. This method applies the selection for you: pass df_parts (for CPP.run()) and/or dict_num_parts (for CPP.run_num()) and it returns, per pair, the row-matched copies alongside the binary labels — ready to drop straight into a CPP instance built on that pair.

Added in version 1.1.0.

Parameters:

labels (array-like, shape (n_samples,)) – Multi-class labels for samples. Must be integers (more than one distinct value).
df_parts (pd.DataFrame, optional) – Parts table (CPP.run value source) aligned row-wise with labels. When given, the returned per-pair copy is subset to that pair’s samples.
dict_num_parts (dict, optional) – Per-part numerical tensors (CPP.run_num value source, {part: array} with sample axis first) aligned row-wise with labels. Subset per pair like df_parts. At least one of df_parts / dict_num_parts is required.
label_test (int, default=1) – Value assigned to the first class of each pair.
label_ref (int, default=0) – Value assigned to the second class of each pair.

Returns:

dict_labels – Dictionary mapping each class pair (a, b) to a tuple (df_parts_pair, dict_num_parts_pair, labels_pair): the row-matched df_parts copy (or None if not supplied), the row-matched dict_num_parts copy (or None if not supplied), and the binary label array over that pair’s samples. The original inputs are never modified.

Return type:

dict

Notes

The selection is applied positionally; df_parts_pair.index records which original rows the pair retained.
Complexity: O(n_samples x n_classes^2): K classes produce K(K-1)/2 pairs (K=10 -> 45, K=20 -> 190), each needing its own CPP instance. Prefer OvO for small K (~<10) and get_labels_ovr() for larger problems.

See also

aaanalysis.CPP: consumes each pair’s df_parts_pair / dict_num_parts_pair and labels_pair.
get_labels_ovr(): one-vs-rest alternative that keeps all samples.

Examples

One-vs-one (OvO) maps each class pair (a, b) to that pair’s samples (the other classes are dropped). Pass the value source — df_parts (for CPP.run) and/or dict_num_parts (for CPP.run_num) — and OvO returns, per pair, the row-matched copy plus the binary labels, ready to drop into CPP (no mask to apply yourself):

import numpy as np
import pandas as pd
import aaanalysis as aa

labels = [0, 0, 1, 1, 2, 2]
df_parts = pd.DataFrame({"jmd_n": list("ACDEFG"), "tmd": list("HIKLMN"), "jmd_c": list("PQRSTV")})
# each pair -> (df_parts_pair, dict_num_parts_pair, labels_pair)
{pair: (len(dfp), y.tolist()) for pair, (dfp, _, y) in
 aa.SequenceFeature.get_labels_ovo(labels, df_parts=df_parts).items()}

{(0, 1): (4, [1, 1, 0, 0]),
 (0, 2): (4, [1, 1, 0, 0]),
 (1, 2): (4, [1, 1, 0, 0])}

Full OvO workflow: each pair already carries its row-matched df_parts, so build a CPP on it directly, run, and collect:

df_seq = aa.load_dataset(name="DOM_GSEC", n=9)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
multiclass = np.array([i % 3 for i in range(len(df_parts))])

frames = []
for (a, b), (df_pair, _, binary) in aa.SequenceFeature.get_labels_ovo(multiclass, df_parts=df_parts).items():
    df_feat = aa.CPP(df_parts=df_pair).run(labels=binary, n_filter=3)
    df_feat["pair"] = f"({a},{b})"
    frames.append(df_feat)
pd.concat(frames, ignore_index=True)[["pair", "feature", "abs_auc"]].head()

[94m1. CPP creates 580140 features for 12 samples
1.1 Assigning scale values to parts[0m
   |.........................| 100.0%[0m[94m
1.2 Streaming pre-filter stats (mask in stream)[0m
   |.........................| 100.0%[0m
[94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)[0m
[94m3. CPP filtering algorithm[0m
[94m4. CPP returns df of 3 unique features with general information and statistics[0m
[94m1. CPP creates 580140 features for 12 samples
1.1 Assigning scale values to parts[0m
   |.........................| 100.0%[0m[94m
1.2 Streaming pre-filter stats (mask in stream)[0m
   |.........................| 100.0%[0m
[94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)[0m
[94m3. CPP filtering algorithm[0m
[94m4. CPP returns df of 3 unique features with general information and statistics[0m
[94m1. CPP creates 580140 features for 12 samples
1.1 Assigning scale values to parts[0m
   |.........................| 100.0%[0m[94m
1.2 Streaming pre-filter stats (mask in stream)[0m
   |.........................| 100.0%[0m
[94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=516049 of 580140)[0m
[94m3. CPP filtering algorithm[0m
[94m4. CPP returns df of 3 unique features with general information and statistics[0m

	pair	feature	abs_auc
0	(0,1)	JMD_N_TMD_N-Pattern(C,4,8)-PALJ810106	0.5
1	(0,1)	TMD-Pattern(N,5,9)-PALJ810106	0.5
2	(0,1)	TMD-Segment(4,12)-PALJ810114	0.5
3	(0,2)	JMD_N_TMD_N-Pattern(N,5,9)-AURR980115	0.5
4	(0,2)	TMD-Pattern(N,4,7,11)-QIAN880105	0.5

What can go wrong? As for OvR, labels must be integers with more than one class:

try:
    aa.SequenceFeature.get_labels_ovo([1, 1, 1])
except ValueError as e:
    print("ValueError:", e)

ValueError: 'labels' should contain more than one different value ({np.int64(1)}).

Further parameters. Besides df_parts, a numerical value source dict_num_parts (the per-part tensors consumed by :meth:CPP.run_num) is subset per pair too, and label_test / label_ref set which class of each pair becomes the positive (1) vs reference (0) label:

labels_multi = [0, 0, 1, 1, 2, 2]
df_parts_ex = pd.DataFrame({"tmd": list("HIKLMN")})
# per-part numerical tensor row-aligned to the labels: shape (n_samples, L_part, D)
dict_num_parts = {"tmd": np.random.default_rng(0).random((6, 4, 2))}
pairs = aa.SequenceFeature.get_labels_ovo(labels_multi, df_parts=df_parts_ex,
                                          dict_num_parts=dict_num_parts,
                                          label_test=1, label_ref=0)
# each pair -> subset (df_parts_pair, dict_num_parts_pair, labels_pair)
{pair: (dnp["tmd"].shape, y.tolist()) for pair, (_, dnp, y) in pairs.items()}

{(0, 1): ((4, 4, 2), [1, 1, 0, 0]),
 (0, 2): ((4, 4, 2), [1, 1, 0, 0]),
 (1, 2): ((4, 4, 2), [1, 1, 0, 0])}