SequenceFeature.get_labels_ovo
- static SequenceFeature.get_labels_ovo(labels=None, df_parts=None, dict_num_parts=None, label_test=1, label_ref=0)[source]
Convert multi-class labels into one-vs-one (OvO) binary labels with row-matched parts.
One-vs-one (OvO) maps each unordered pair of classes
(a, b)to the subset of samples belonging to either class together with a binary label array over that subset (classaas test, classbas reference). Because the other classes are discarded, each pair needs its own row subset of the value source. This method applies the selection for you: passdf_parts(forCPP.run()) and/ordict_num_parts(forCPP.run_num()) and it returns, per pair, the row-matched copies alongside the binary labels — ready to drop straight into aCPPinstance built on that pair.Added in version 1.1.0.
- Parameters:
labels (array-like, shape (n_samples,)) – Multi-class labels for samples. Must be integers (more than one distinct value).
df_parts (pd.DataFrame, optional) – Parts table (
CPP.runvalue source) aligned row-wise withlabels. When given, the returned per-pair copy is subset to that pair’s samples.dict_num_parts (dict, optional) – Per-part numerical tensors (
CPP.run_numvalue source,{part: array}with sample axis first) aligned row-wise withlabels. Subset per pair likedf_parts. At least one ofdf_parts/dict_num_partsis required.label_test (int, default=1) – Value assigned to the first class of each pair.
label_ref (int, default=0) – Value assigned to the second class of each pair.
- Returns:
dict_labels – Dictionary mapping each class pair
(a, b)to a tuple(df_parts_pair, dict_num_parts_pair, labels_pair): the row-matcheddf_partscopy (orNoneif not supplied), the row-matcheddict_num_partscopy (orNoneif not supplied), and the binary label array over that pair’s samples. The original inputs are never modified.- Return type:
Notes
The selection is applied positionally;
df_parts_pair.indexrecords which original rows the pair retained.Complexity: O(n_samples x n_classes^2): K classes produce K(K-1)/2 pairs (K=10 -> 45, K=20 -> 190), each needing its own CPP instance. Prefer OvO for small K (~<10) and
get_labels_ovr()for larger problems.
See also
aaanalysis.CPP: consumes each pair’sdf_parts_pair/dict_num_parts_pairandlabels_pair.get_labels_ovr(): one-vs-rest alternative that keeps all samples.
Examples
One-vs-one (OvO) maps each class pair
(a, b)to that pair’s samples (the other classes are dropped). Pass the value source —df_parts(forCPP.run) and/ordict_num_parts(forCPP.run_num) — and OvO returns, per pair, the row-matched copy plus the binary labels, ready to drop intoCPP(no mask to apply yourself):import numpy as np import pandas as pd import aaanalysis as aa labels = [0, 0, 1, 1, 2, 2] df_parts = pd.DataFrame({"jmd_n": list("ACDEFG"), "tmd": list("HIKLMN"), "jmd_c": list("PQRSTV")}) # each pair -> (df_parts_pair, dict_num_parts_pair, labels_pair) {pair: (len(dfp), y.tolist()) for pair, (dfp, _, y) in aa.SequenceFeature.get_labels_ovo(labels, df_parts=df_parts).items()}
{(0, 1): (4, [1, 1, 0, 0]), (0, 2): (4, [1, 1, 0, 0]), (1, 2): (4, [1, 1, 0, 0])}
Full OvO workflow: each pair already carries its row-matched
df_parts, so build aCPPon it directly, run, and collect:df_seq = aa.load_dataset(name="DOM_GSEC", n=9) sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) multiclass = np.array([i % 3 for i in range(len(df_parts))]) frames = [] for (a, b), (df_pair, _, binary) in aa.SequenceFeature.get_labels_ovo(multiclass, df_parts=df_parts).items(): df_feat = aa.CPP(df_parts=df_pair).run(labels=binary, n_filter=3) df_feat["pair"] = f"({a},{b})" frames.append(df_feat) pd.concat(frames, ignore_index=True)[["pair", "feature", "abs_auc"]].head()
[94m1. CPP creates 580140 features for 12 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m [94m1. CPP creates 580140 features for 12 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m [94m1. CPP creates 580140 features for 12 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=516049 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m
pair feature abs_auc 0 (0,1) JMD_N_TMD_N-Pattern(C,4,8)-PALJ810106 0.5 1 (0,1) TMD-Pattern(N,5,9)-PALJ810106 0.5 2 (0,1) TMD-Segment(4,12)-PALJ810114 0.5 3 (0,2) JMD_N_TMD_N-Pattern(N,5,9)-AURR980115 0.5 4 (0,2) TMD-Pattern(N,4,7,11)-QIAN880105 0.5 What can go wrong? As for OvR, labels must be integers with more than one class:
try: aa.SequenceFeature.get_labels_ovo([1, 1, 1]) except ValueError as e: print("ValueError:", e)
ValueError: 'labels' should contain more than one different value ({np.int64(1)}).