dPULearn.eval

static dPULearn.eval(X, list_labels, names_datasets=None, X_neg=None, comp_kld=False, n_jobs=None)[source]

Evaluates the quality of different sets of identified negatives.

The quality is assessed regarding two quality groups:

  • Homogeneity within the reliably identified negatives (0)

  • Dissimilarity between the reliably identified negatives and the groups of positive samples (‘pos’), unlabeled samples (‘unl’), and a ground-truth negative (‘neg’) sample group if provided by X_neg

Added in version 0.1.0.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • list_labels (array-like, shape (n_datasets, n_samples)) – List of arrays with dataset labels for samples in X obtained by the dPULearn.fit() method. Label values should be either 0 (identified negative), 1 (positive) or 2 (unlabeled).

  • names_datasets (list, optional) – List of dataset names corresponding to list_labels.

  • X_neg (array-like, shape (n_samples_neg, n_features), optional) – Feature matrix where n_samples_neg is the number ground-truth negative samples and n_features is the number of features. Features must correspond to X.

  • comp_kld (bool, default=False) – Whether to compute Kullback-Leibler Divergence (KLD) to assess the distribution alignment between identified negatives and other data groups. Disable (False) if X is sparse or has low co-variance.

  • n_jobs (int, None, or -1, default=None) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_eval – Evaluation results for each set of identified negatives from list_labels. For each set, statistical measures were averaged across all features.

Return type:

pd.DataFrame

Notes

df_eval includes the following columns:

  • ‘name’: Name of the dataset if names_datasets is provided (typically named by identification approach).

  • ‘n_rel_neg’: Number of identified negatives.

  • ‘avg_STD’: Average standard deviation (STD) assessing homogeneity of identified negatives. Lower values indicate greater homogeneity.

  • ‘avg_IQR’: Average interquartile range (IQR) assessing homogeneity of identified negatives. Lower values suggest greater homogeneity.

  • ‘avg_abs_AUC_pos’ / ‘avg_abs_AUC_unl’ / ‘avg_abs_AUC_neg’: Average absolute area under the curve (AUC) assessing the dissimilarity between the set of identified negatives and each other group (positives, unlabeled, ground-truth negatives). Higher values indicate greater dissimilarity.

  • ‘avg_KLD_pos’ / ‘avg_KLD_unl’ / ‘avg_KLD_neg’: Average Kullback-Leibler Divergence (KLD) assessing the dissimilarity of distributions between the set of identified negatives and each other group. Higher values indicate greater dissimilarity. These columns are omitted if comp_kld is set to False.

See also

Examples

Create a small example dataset for dPUlearn containing positive (1), unlabeled (2) data samples and the identified negatives (0):

import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
# Three different sets of labels
list_labels = [[1, 1, 2, 0, 0], [1, 1, 0, 2, 0], [1, 1, 0, 0, 2]]

Use the dPULearn().eval() method to obtain the evaluation for each label set:

dpul = aa.dPULearn()
df_eval = dpul.eval(X, list_labels=list_labels)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl
1 Set 1 2 0.175000 0.175000 0.437500 0.250000
2 Set 2 2 0.187500 0.187500 0.500000 0.250000
3 Set 3 2 0.037500 0.037500 0.437500 0.500000

The dataset names given in the ‘name’ column or can be customized, typically using the name of the identification method, e.g., ‘euclidean’ for Euclidean distance-based. This can be achieved by setting names_datasets:

names_datasets = ["Dataset 1", "Dataset 2", "Dataset 3"]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl
1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000
2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000
3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000

The df_eval DataFrame provides two categories of quality measures:

  1. Homogeneity Within Negatives: Measured by ‘avg_STD’ and ‘avg_IQR’, indicating the uniformity and spread of identified negatives.

  2. Dissimilarity With Other Groups: Represented here by ‘avg_abs_AUC_pos/unl’, comparing identified negatives with positives (‘pos’, label 1) and unlabeled samples (‘unl’, label 2).

For a more comprehensive analysis, include X_neg as a feature matrix of ground-truth negatives to assess their dissimilarity with the identified negatives:

X_neg = [[0.5, 0.8], [0.4, 0.4]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl avg_abs_AUC_neg
1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000 0.187500
2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000 0.187500
3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000 0.500000

If the variance within the data is high enough, the Kullback-Leibler Divergence (KLD) can be computed to assess the dissimilarity of distributions between the identified negatives and the other groups:

# Extend the unlabeled group by one sample to fulfill variance requirements
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8]])
list_labels = [[1, 1, 2, 0, 0, 2], [1, 1, 0, 2, 0, 2], [1, 1, 0, 0, 2, 2]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg, comp_kld=True)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_KLD_pos avg_abs_AUC_unl avg_KLD_unl avg_abs_AUC_neg avg_KLD_neg
1 Dataset 1 2 0.175000 0.175000 0.437500 1.414400 0.125000 0.003100 0.187500 0.181300
2 Dataset 2 2 0.187500 0.187500 0.500000 1.366900 0.125000 0.003300 0.187500 0.104100
3 Dataset 3 2 0.037500 0.037500 0.437500 1.016800 0.500000 30.317900 0.500000 12.020200

Further parameters. dPULearn.eval also accepts: n_jobs — Number of CPU cores (>=1) used for multiprocessing.