SequenceFeature.get_labels_ovr
- static SequenceFeature.get_labels_ovr(labels=None, label_test=1, label_ref=0)[source]
Convert multi-class labels into one-vs-rest (OvR) binary label arrays.
One-vs-rest (OvR) maps each of the K classes to a full-length binary label array in which that class is the test group and all remaining classes are the reference group. Since no samples are dropped, the K arrays can be looped through a single
CPPinstance viaCPP.run()(thedf_partsis unchanged), yielding one binary feature set per class. Discarding the other classes instead isget_labels_ovo().Added in version 1.1.0.
- Parameters:
labels (array-like, shape (n_samples,)) – Multi-class labels for samples. Must be integers (more than one distinct value); for continuous targets discretize first with
get_labels_quantile().label_test (int, default=1) – Value assigned to the target class of each one-vs-rest array.
label_ref (int, default=0) – Value assigned to all remaining classes.
- Returns:
dict_labels – Dictionary mapping each class label to its one-vs-rest binary label array (numpy array of shape (n_samples,)), keyed in sorted class order.
- Return type:
Notes
Each returned binary label array is directly usable as the
labelsargument ofCPP.run()/CPP.run_num().To aggregate the per-class results, run CPP per array and concatenate the returned
df_featframes, tagging each with its class key.Complexity: O(n_samples x n_classes); scales linearly in both, so OvR stays cheap for large K.
See also
aaanalysis.CPP: consumes each returned binary label array viaCPP.run().get_labels_ovo(): pairwise (one-vs-one) alternative that subsets samples.get_labels_quantile(): discretize a continuous target into binary labels.
Examples
One-vs-rest (OvR) turns multi-class labels into one binary label array per class (that class = test, all others = reference). No samples are dropped, so every array runs through a single
CPPinstance:import numpy as np import pandas as pd import aaanalysis as aa labels = [0, 0, 1, 1, 2, 2] dict_labels = aa.SequenceFeature.get_labels_ovr(labels) dict_labels
{0: array([1, 1, 0, 0, 0, 0]), 1: array([0, 0, 1, 1, 0, 0]), 2: array([0, 0, 0, 0, 1, 1])}
Loop the per-class arrays through
CPP.runand concatenate, tagging eachdf_featwith its class:df_seq = aa.load_dataset(name="DOM_GSEC", n=9) sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) multiclass = np.array([i % 3 for i in range(len(df_parts))]) cpp = aa.CPP(df_parts=df_parts) frames = [] for cls, vec in aa.SequenceFeature.get_labels_ovr(multiclass).items(): df_feat = cpp.run(labels=vec, n_filter=3) df_feat["class"] = cls frames.append(df_feat) pd.concat(frames, ignore_index=True)[["class", "feature", "abs_auc"]].head()
[94m1. CPP creates 580140 features for 18 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m [94m1. CPP creates 580140 features for 18 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=516049 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m [94m1. CPP creates 580140 features for 18 samples 1.1 Assigning scale values to parts[0m |.........................| 100.0%[0m[94m 1.2 Streaming pre-filter stats (mask in stream)[0m |.........................| 100.0%[0m [94m2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=523492 of 580140)[0m [94m3. CPP filtering algorithm[0m [94m4. CPP returns df of 3 unique features with general information and statistics[0m
class feature abs_auc 0 0 TMD_C_JMD_C-Pattern(C,1,4,8)-PALJ810115 0.500 1 0 TMD_C_JMD_C-Pattern(C,4,8)-GEIM800102 0.500 2 0 JMD_N_TMD_N-Pattern(N,3,6,10,14)-CHOP780215 0.500 3 1 TMD-Pattern(C,4,7,11,15)-RACS820103 0.493 4 1 JMD_N_TMD_N-Pattern(C,6,10,14)-AURR980109 0.486 What can go wrong? Labels must be integers with more than one class, and
label_testmust differ fromlabel_ref:for bad in [lambda: aa.SequenceFeature.get_labels_ovr([1, 1, 1]), lambda: aa.SequenceFeature.get_labels_ovr([0, 1], label_test=1, label_ref=1), lambda: aa.SequenceFeature.get_labels_ovr([0.0, 1.0, 2.0])]: try: bad() except ValueError as e: print("ValueError:", e)
ValueError: 'labels' should contain more than one different value ({np.int64(1)}). ValueError: 'label_test' (1) should differ from 'label_ref' (1). ValueError: Labels in 'labels' should be type int, but contain: {<class 'numpy.float64'>}