aaanalysis.dPULearn.eval
- static dPULearn.eval(X, list_labels=None, names_datasets=None, X_neg=None, comp_kld=False, n_jobs=None)[source]
Evaluates the quality of different sets of identified negatives.
The quality is assessed regarding two quality groups:
Homogeneity within the reliably identified negatives (0)
Dissimilarity between the reliably identified negatives and the groups of positive samples (‘pos’), unlabeled samples (‘unl’), and a ground-truth negative (‘neg’) sample group if provided by
X_neg
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
list_labels (array-like, shape (n_datasets, n_samples)) – List of arrays with dataset labels for samples in
Xobtained by thedPULearn.fit()method. Label values should be either 0 (identified negative), 1 (positive) or 2 (unlabeled).names_datasets (list, optional) – List of dataset names corresponding to
list_labels.X_neg (array-like, shape (n_samples_neg, n_features), optional) – Feature matrix where n_samples_neg is the number ground-truth negative samples and n_features is the number of features. Features must correspond to
X.comp_kld (bool, default=False) – Whether to compute Kullback-Leibler Divergence (KLD) to assess the distribution alignment between identified negatives and other data groups. Disable (
False) ifXis sparse or has low co-variance.n_jobs (int, None, or -1, default=None) – Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.
- Returns:
df_eval – Evaluation results for each set of identified negatives from
list_labels. For each set, statistical measures were averaged across all features.- Return type:
pd.DataFrame
Notes
df_evalincludes the following columns:‘name’: Name of the dataset if
namesis provided (typically named by identification approach).‘n_rel_neg’: Number of identified negatives.
‘avg_std’: Average standard deviation (STD) assessing homogeneity of identified negatives. Lower values indicate greater homogeneity.
‘avg_iqr’: Average interquartile range (IQR) assessing homogeneity of identified negatives. Lower values suggest greater homogeneity.
‘avg_abs_auc_DATASET’: Average absolute area under the curve (AUC) assessing the dissimilarity between the set of identified negatives with other groups (positives, unlabeled, ground-truth negatives). Separate columns are provided for each comparison. Higher values indicate greater dissimilarity.
‘avg_kld_DATASET’: Average Kullback-Leibler Divergence (KLD) assessing the dissimilarity of distributions between the set of identified negatives and the other groups. Higher values indicate greater dissimilarity. These columns are omitted if
kldis set toFalse.
See also
dPULearnPlot.eval(): the respective plotting method.dPULearn: Learning From Unbalanced Data for details on different evaluation strategies.
Examples
Create a small example dataset for dPUlearn containing positive (1), unlabeled (2) data samples and the identified negatives (0):
import aaanalysis as aa import pandas as pd import numpy as np aa.options["verbose"] = False X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]]) # Three different sets of labels list_labels = [[1, 1, 2, 0, 0], [1, 1, 0, 2, 0], [1, 1, 0, 0, 2]]
Use the
dPULearn().eval()method to obtain the evaluation for each label set:dpul = aa.dPULearn() df_eval = dpul.eval(X, list_labels=list_labels) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl 1 Set 1 2 0.175000 0.175000 0.437500 0.250000 2 Set 2 2 0.187500 0.187500 0.500000 0.250000 3 Set 3 2 0.037500 0.037500 0.437500 0.500000 The dataset names given in the ‘name’ column or can be customized, typically using the name of the identification method, e.g., ‘euclidean’ for Euclidean distance-based. This can be achieved by setting
names_datasets:names_datasets = ["Dataset 1", "Dataset 2", "Dataset 3"] df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl 1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000 2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000 3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000 The
df_evalDataFrame provides two categories of quality measures:Homogeneity Within Negatives: Measured by ‘avg_STD’ and ‘avg_IQR’, indicating the uniformity and spread of identified negatives.
Dissimilarity With Other Groups: Represented here by ‘avg_abs_AUC_pos/unl’, comparing identified negatives with positives (‘pos’, label 1) and unlabeled samples (‘unl’, label 2).
For a more comprehensive analysis, include
X_negas a feature matrix of ground-truth negatives to assess their dissimilarity with the identified negatives:X_neg = [[0.5, 0.8], [0.4, 0.4]] df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl avg_abs_AUC_neg 1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000 0.187500 2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000 0.187500 3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000 0.500000 If the variance within the data is high enough, the Kullback-Leibler Divergence (KLD) can be computed to assess the dissimilarity of distributions between the identified negatives and the other groups:
# Extend the unlabeled group by one sample to fulfill variance requirements X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8]]) list_labels = [[1, 1, 2, 0, 0, 2], [1, 1, 0, 2, 0, 2], [1, 1, 0, 0, 2, 2]] df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg, comp_kld=True) aa.display_df(df_eval)
name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_KLD_pos avg_abs_AUC_unl avg_KLD_unl avg_abs_AUC_neg avg_KLD_neg 1 Dataset 1 2 0.175000 0.175000 0.437500 1.414400 0.125000 0.003100 0.187500 0.181300 2 Dataset 2 2 0.187500 0.187500 0.500000 1.366900 0.125000 0.003300 0.187500 0.104100 3 Dataset 3 2 0.037500 0.037500 0.437500 1.016800 0.500000 30.317900 0.500000 12.020200