dPULearn.eval
- static dPULearn.eval(X, list_labels, names_datasets=None, X_neg=None, comp_kld=False, n_jobs=None)[source]
Evaluates the quality of different sets of identified negatives.
The quality is assessed regarding two quality groups:
Homogeneity within the reliably identified negatives (0)
Dissimilarity between the reliably identified negatives and the groups of positive samples (‘pos’), unlabeled samples (‘unl’), and a ground-truth negative (‘neg’) sample group if provided by
X_neg
Added in version 0.1.0.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
list_labels (array-like, shape (n_datasets, n_samples)) – List of arrays with dataset labels for samples in
Xobtained by thedPULearn.fit()method. Label values should be either 0 (identified negative), 1 (positive) or 2 (unlabeled).names_datasets (list, optional) – List of dataset names corresponding to
list_labels.X_neg (array-like, shape (n_samples_neg, n_features), optional) – Feature matrix where n_samples_neg is the number ground-truth negative samples and n_features is the number of features. Features must correspond to
X.comp_kld (bool, default=False) – Whether to compute Kullback-Leibler Divergence (KLD) to assess the distribution alignment between identified negatives and other data groups. Disable (
False) ifXis sparse or has low co-variance.n_jobs (int, None, or -1, default=None) – Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.
- Returns:
df_eval – Evaluation results for each set of identified negatives from
list_labels. For each set, statistical measures were averaged across all features.- Return type:
pd.DataFrame
Notes
df_evalincludes the following columns:‘name’: Name of the dataset if
names_datasetsis provided (typically named by identification approach).‘n_rel_neg’: Number of identified negatives.
‘avg_STD’: Average standard deviation (STD) assessing homogeneity of identified negatives. Lower values indicate greater homogeneity.
‘avg_IQR’: Average interquartile range (IQR) assessing homogeneity of identified negatives. Lower values suggest greater homogeneity.
‘avg_abs_AUC_pos’ / ‘avg_abs_AUC_unl’ / ‘avg_abs_AUC_neg’: Average absolute area under the curve (AUC) assessing the dissimilarity between the set of identified negatives and each other group (positives, unlabeled, ground-truth negatives). Higher values indicate greater dissimilarity.
‘avg_KLD_pos’ / ‘avg_KLD_unl’ / ‘avg_KLD_neg’: Average Kullback-Leibler Divergence (KLD) assessing the dissimilarity of distributions between the set of identified negatives and each other group. Higher values indicate greater dissimilarity. These columns are omitted if
comp_kldis set toFalse.
See also
dPULearnPlot.eval(): the respective plotting method.dPULearn: Learning From Unbalanced Data for details on different evaluation strategies.
Examples
Create a small example dataset for dPUlearn containing positive (1), unlabeled (2) data samples and the identified negatives (0):
import aaanalysis as aa import pandas as pd import numpy as np aa.options["verbose"] = False X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]]) # Three different sets of labels list_labels = [[1, 1, 2, 0, 0], [1, 1, 0, 2, 0], [1, 1, 0, 0, 2]]
Use the
dPULearn().eval()method to obtain the evaluation for each label set:dpul = aa.dPULearn() df_eval = dpul.eval(X, list_labels=list_labels) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl 1 Set 1 2 0.175000 0.175000 0.437500 0.250000 2 Set 2 2 0.187500 0.187500 0.500000 0.250000 3 Set 3 2 0.037500 0.037500 0.437500 0.500000 The dataset names given in the ‘name’ column or can be customized, typically using the name of the identification method, e.g., ‘euclidean’ for Euclidean distance-based. This can be achieved by setting
names_datasets:names_datasets = ["Dataset 1", "Dataset 2", "Dataset 3"] df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl 1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000 2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000 3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000 The
df_evalDataFrame provides two categories of quality measures:Homogeneity Within Negatives: Measured by ‘avg_STD’ and ‘avg_IQR’, indicating the uniformity and spread of identified negatives.
Dissimilarity With Other Groups: Represented here by ‘avg_abs_AUC_pos/unl’, comparing identified negatives with positives (‘pos’, label 1) and unlabeled samples (‘unl’, label 2).
For a more comprehensive analysis, include
X_negas a feature matrix of ground-truth negatives to assess their dissimilarity with the identified negatives:X_neg = [[0.5, 0.8], [0.4, 0.4]] df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl avg_abs_AUC_neg 1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000 0.187500 2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000 0.187500 3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000 0.500000 If the variance within the data is high enough, the Kullback-Leibler Divergence (KLD) can be computed to assess the dissimilarity of distributions between the identified negatives and the other groups:
# Extend the unlabeled group by one sample to fulfill variance requirements X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8]]) list_labels = [[1, 1, 2, 0, 0, 2], [1, 1, 0, 2, 0, 2], [1, 1, 0, 0, 2, 2]] df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg, comp_kld=True) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_KLD_pos avg_abs_AUC_unl avg_KLD_unl avg_abs_AUC_neg avg_KLD_neg 1 Dataset 1 2 0.175000 0.175000 0.437500 1.414400 0.125000 0.003100 0.187500 0.181300 2 Dataset 2 2 0.187500 0.187500 0.500000 1.366900 0.125000 0.003300 0.187500 0.104100 3 Dataset 3 2 0.037500 0.037500 0.437500 1.016800 0.500000 30.317900 0.500000 12.020200 Further parameters.
dPULearn.evalalso accepts:n_jobs— Number of CPU cores (>=1) used for multiprocessing.