dPULearn.eval

static dPULearn.eval(X, list_labels, names_datasets=None, X_neg=None, comp_kld=False, n_jobs=None)[source]

Evaluates the quality of different sets of identified negatives.

The quality is assessed regarding two quality groups:

Homogeneity within the reliably identified negatives (0)
Dissimilarity between the reliably identified negatives and the groups of positive samples (‘pos’), unlabeled samples (‘unl’), and a ground-truth negative (‘neg’) sample group if provided by X_neg

Added in version 0.1.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
list_labels (array-like, shape (n_datasets, n_samples)) – List of arrays with dataset labels for samples in X obtained by the dPULearn.fit() method. Label values should be either 0 (identified negative), 1 (positive) or 2 (unlabeled).
names_datasets (list, optional) – List of dataset names corresponding to list_labels.
X_neg (array-like, shape (n_samples_neg, n_features), optional) – Feature matrix where n_samples_neg is the number ground-truth negative samples and n_features is the number of features. Features must correspond to X.
comp_kld (bool, default=False) – Whether to compute Kullback-Leibler Divergence (KLD) to assess the distribution alignment between identified negatives and other data groups. Disable (False) if X is sparse or has low co-variance.
n_jobs (int, None, or -1, default=None) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_eval – Evaluation results for each set of identified negatives from list_labels. For each set, statistical measures were averaged across all features.

Return type:

pd.DataFrame

Notes

df_eval includes the following columns:

‘name’: Name of the dataset if names_datasets is provided (typically named by identification approach).
‘n_rel_neg’: Number of identified negatives.
‘avg_STD’: Average standard deviation (STD) assessing homogeneity of identified negatives. Lower values indicate greater homogeneity.
‘avg_IQR’: Average interquartile range (IQR) assessing homogeneity of identified negatives. Lower values suggest greater homogeneity.
‘avg_abs_AUC_pos’ / ‘avg_abs_AUC_unl’ / ‘avg_abs_AUC_neg’: Average absolute area under the curve (AUC) assessing the dissimilarity between the set of identified negatives and each other group (positives, unlabeled, ground-truth negatives). Higher values indicate greater dissimilarity.
‘avg_KLD_pos’ / ‘avg_KLD_unl’ / ‘avg_KLD_neg’: Average Kullback-Leibler Divergence (KLD) assessing the dissimilarity of distributions between the set of identified negatives and each other group. Higher values indicate greater dissimilarity. These columns are omitted if comp_kld is set to False.

See also

dPULearnPlot.eval(): the respective plotting method.
dPULearn: Learning From Unbalanced Data for details on different evaluation strategies.

Examples

Create a small example dataset for dPUlearn containing positive (1), unlabeled (2) data samples and the identified negatives (0):

import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
# Three different sets of labels
list_labels = [[1, 1, 2, 0, 0], [1, 1, 0, 2, 0], [1, 1, 0, 0, 2]]

Use the dPULearn().eval() method to obtain the evaluation for each label set:

dpul = aa.dPULearn()
df_eval = dpul.eval(X, list_labels=list_labels)
aa.display_df(df_eval)

	name	n_rel_neg	avg_STD	avg_IQR	avg_abs_AUC_pos	avg_abs_AUC_unl
1	Set 1	2	0.175000	0.175000	0.437500	0.250000
2	Set 2	2	0.187500	0.187500	0.500000	0.250000
3	Set 3	2	0.037500	0.037500	0.437500	0.500000

The dataset names given in the ‘name’ column or can be customized, typically using the name of the identification method, e.g., ‘euclidean’ for Euclidean distance-based. This can be achieved by setting names_datasets:

names_datasets = ["Dataset 1", "Dataset 2", "Dataset 3"]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets)
aa.display_df(df_eval)

	name	n_rel_neg	avg_STD	avg_IQR	avg_abs_AUC_pos	avg_abs_AUC_unl
1	Dataset 1	2	0.175000	0.175000	0.437500	0.250000
2	Dataset 2	2	0.187500	0.187500	0.500000	0.250000
3	Dataset 3	2	0.037500	0.037500	0.437500	0.500000

The df_eval DataFrame provides two categories of quality measures:

Homogeneity Within Negatives: Measured by ‘avg_STD’ and ‘avg_IQR’, indicating the uniformity and spread of identified negatives.
Dissimilarity With Other Groups: Represented here by ‘avg_abs_AUC_pos/unl’, comparing identified negatives with positives (‘pos’, label 1) and unlabeled samples (‘unl’, label 2).

For a more comprehensive analysis, include X_neg as a feature matrix of ground-truth negatives to assess their dissimilarity with the identified negatives:

X_neg = [[0.5, 0.8], [0.4, 0.4]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg)
aa.display_df(df_eval)

	name	n_rel_neg	avg_STD	avg_IQR	avg_abs_AUC_pos	avg_abs_AUC_unl	avg_abs_AUC_neg
1	Dataset 1	2	0.175000	0.175000	0.437500	0.250000	0.187500
2	Dataset 2	2	0.187500	0.187500	0.500000	0.250000	0.187500
3	Dataset 3	2	0.037500	0.037500	0.437500	0.500000	0.500000

If the variance within the data is high enough, the Kullback-Leibler Divergence (KLD) can be computed to assess the dissimilarity of distributions between the identified negatives and the other groups:

# Extend the unlabeled group by one sample to fulfill variance requirements
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8]])
list_labels = [[1, 1, 2, 0, 0, 2], [1, 1, 0, 2, 0, 2], [1, 1, 0, 0, 2, 2]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg, comp_kld=True)
aa.display_df(df_eval)

	name	n_rel_neg	avg_STD	avg_IQR	avg_abs_AUC_pos	avg_KLD_pos	avg_abs_AUC_unl	avg_KLD_unl	avg_abs_AUC_neg	avg_KLD_neg
1	Dataset 1	2	0.175000	0.175000	0.437500	1.414400	0.125000	0.003100	0.187500	0.181300
2	Dataset 2	2	0.187500	0.187500	0.500000	1.366900	0.125000	0.003300	0.187500	0.104100
3	Dataset 3	2	0.037500	0.037500	0.437500	1.016800	0.500000	30.317900	0.500000	12.020200

Further parameters. dPULearn.eval also accepts: n_jobs — Number of CPU cores (>=1) used for multiprocessing.

# Further parameters: n_jobs sets the number of CPU cores used for the per-feature evaluation.
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, n_jobs=1)
aa.display_df(df_eval)

	name	n_rel_neg	avg_STD	avg_IQR	avg_abs_AUC_pos	avg_abs_AUC_unl
1	Dataset 1	2	0.175000	0.175000	0.437500	0.125000
2	Dataset 2	2	0.187500	0.187500	0.500000	0.125000
3	Dataset 3	2	0.037500	0.037500	0.437500	0.500000