SequenceFeature.get_labels_ovr

static SequenceFeature.get_labels_ovr(labels=None, label_test=1, label_ref=0)[source]

Convert multi-class labels into one-vs-rest (OvR) binary label arrays.

One-vs-rest (OvR) maps each of the K classes to a full-length binary label array in which that class is the test group and all remaining classes are the reference group. Since no samples are dropped, the K arrays can be looped through a single CPP instance via CPP.run() (the df_parts is unchanged), yielding one binary feature set per class. Discarding the other classes instead is get_labels_ovo().

Added in version 1.1.0.

Parameters:
  • labels (array-like, shape (n_samples,)) – Multi-class labels for samples. Must be integers (more than one distinct value); for continuous targets discretize first with get_labels_quantile().

  • label_test (int, default=1) – Value assigned to the target class of each one-vs-rest array.

  • label_ref (int, default=0) – Value assigned to all remaining classes.

Returns:

dict_labels – Dictionary mapping each class label to its one-vs-rest binary label array (numpy array of shape (n_samples,)), keyed in sorted class order.

Return type:

dict

Notes

  • Each returned binary label array is directly usable as the labels argument of CPP.run() / CPP.run_num().

  • To aggregate the per-class results, run CPP per array and concatenate the returned df_feat frames, tagging each with its class key.

  • Complexity: O(n_samples x n_classes); scales linearly in both, so OvR stays cheap for large K.

See also

Examples

One-vs-rest (OvR) turns multi-class labels into one binary label array per class (that class = test, all others = reference). No samples are dropped, so every array runs through a single CPP instance:

import numpy as np
import pandas as pd
import aaanalysis as aa

labels = [0, 0, 1, 1, 2, 2]
dict_labels = aa.SequenceFeature.get_labels_ovr(labels)
dict_labels
{0: array([1, 1, 0, 0, 0, 0]),
 1: array([0, 0, 1, 1, 0, 0]),
 2: array([0, 0, 0, 0, 1, 1])}

Loop the per-class arrays through CPP.run and concatenate, tagging each df_feat with its class:

df_seq = aa.load_dataset(name="DOM_GSEC", n=9)
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
multiclass = np.array([i % 3 for i in range(len(df_parts))])

cpp = aa.CPP(df_parts=df_parts)
frames = []
for cls, vec in aa.SequenceFeature.get_labels_ovr(multiclass).items():
    df_feat = cpp.run(labels=vec, n_filter=3)
    df_feat["class"] = cls
    frames.append(df_feat)
pd.concat(frames, ignore_index=True)[["class", "feature", "abs_auc"]].head()
1. CPP creates 580140 features for 18 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=510711 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
1. CPP creates 580140 features for 18 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=516049 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
1. CPP creates 580140 features for 18 samples
1.1 Assigning scale values to parts
   |.........................| 100.0%
1.2 Streaming pre-filter stats (mask in stream)
   |.........................| 100.0%
2. CPP pre-filters 29007 features (5.0%) with highest 'abs_mean_dif' and 'max_std_test' <= 0.2 (kept=523492 of 580140)
3. CPP filtering algorithm
4. CPP returns df of 3 unique features with general information and statistics
class feature abs_auc
0 0 TMD_C_JMD_C-Pattern(C,1,4,8)-PALJ810115 0.500
1 0 TMD_C_JMD_C-Pattern(C,4,8)-GEIM800102 0.500
2 0 JMD_N_TMD_N-Pattern(N,3,6,10,14)-CHOP780215 0.500
3 1 TMD-Pattern(C,4,7,11,15)-RACS820103 0.493
4 1 JMD_N_TMD_N-Pattern(C,6,10,14)-AURR980109 0.486

What can go wrong? Labels must be integers with more than one class, and label_test must differ from label_ref:

for bad in [lambda: aa.SequenceFeature.get_labels_ovr([1, 1, 1]),
            lambda: aa.SequenceFeature.get_labels_ovr([0, 1], label_test=1, label_ref=1),
            lambda: aa.SequenceFeature.get_labels_ovr([0.0, 1.0, 2.0])]:
    try:
        bad()
    except ValueError as e:
        print("ValueError:", e)
ValueError: 'labels' should contain more than one different value ({np.int64(1)}).
ValueError: 'label_test' (1) should differ from 'label_ref' (1).
ValueError: Labels in 'labels' should be type int, but contain: {<class 'numpy.float64'>}