aaanalysis.dPULearn.eval

static dPULearn.eval(X, list_labels=None, names_datasets=None, X_neg=None, comp_kld=False, n_jobs=None)[source]

Evaluates the quality of different sets of identified negatives.

The quality is assessed regarding two quality groups:

  • Homogeneity within the reliably identified negatives (0)

  • Dissimilarity between the reliably identified negatives and the groups of positive samples (‘pos’), unlabeled samples (‘unl’), and a ground-truth negative (‘neg’) sample group if provided by X_neg

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • list_labels (array-like, shape (n_datasets, n_samples)) – List of arrays with dataset labels for samples in X obtained by the dPULearn.fit() method. Label values should be either 0 (identified negative), 1 (positive) or 2 (unlabeled).

  • names_datasets (list, optional) – List of dataset names corresponding to list_labels.

  • X_neg (array-like, shape (n_samples_neg, n_features), optional) – Feature matrix where n_samples_neg is the number ground-truth negative samples and n_features is the number of features. Features must correspond to X.

  • comp_kld (bool, default=False) – Whether to compute Kullback-Leibler Divergence (KLD) to assess the distribution alignment between identified negatives and other data groups. Disable (False) if X is sparse or has low co-variance.

  • n_jobs (int, None, or -1, default=None) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_eval – Evaluation results for each set of identified negatives from list_labels. For each set, statistical measures were averaged across all features.

Return type:

pd.DataFrame

Notes

df_eval includes the following columns:

  • ‘name’: Name of the dataset if names is provided (typically named by identification approach).

  • ‘n_rel_neg’: Number of identified negatives.

  • ‘avg_std’: Average standard deviation (STD) assessing homogeneity of identified negatives. Lower values indicate greater homogeneity.

  • ‘avg_iqr’: Average interquartile range (IQR) assessing homogeneity of identified negatives. Lower values suggest greater homogeneity.

  • ‘avg_abs_auc_DATASET’: Average absolute area under the curve (AUC) assessing the dissimilarity between the set of identified negatives with other groups (positives, unlabeled, ground-truth negatives). Separate columns are provided for each comparison. Higher values indicate greater dissimilarity.

  • ‘avg_kld_DATASET’: Average Kullback-Leibler Divergence (KLD) assessing the dissimilarity of distributions between the set of identified negatives and the other groups. Higher values indicate greater dissimilarity. These columns are omitted if kld is set to False.

See also

Examples

Create a small example dataset for dPUlearn containing positive (1), unlabeled (2) data samples and the identified negatives (0):

import aaanalysis as aa
import pandas as pd
import numpy as np
aa.options["verbose"] = False
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7]])
# Three different sets of labels
list_labels = [[1, 1, 2, 0, 0], [1, 1, 0, 2, 0], [1, 1, 0, 0, 2]]

Use the dPULearn().eval() method to obtain the evaluation for each label set:

dpul = aa.dPULearn()
df_eval = dpul.eval(X, list_labels=list_labels)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl
1 Set 1 2 0.175000 0.175000 0.437500 0.250000
2 Set 2 2 0.187500 0.187500 0.500000 0.250000
3 Set 3 2 0.037500 0.037500 0.437500 0.500000

The dataset names given in the ‘name’ column or can be customized, typically using the name of the identification method, e.g., ‘euclidean’ for Euclidean distance-based. This can be achieved by setting names_datasets:

names_datasets = ["Dataset 1", "Dataset 2", "Dataset 3"]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl
1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000
2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000
3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000

The df_eval DataFrame provides two categories of quality measures:

  1. Homogeneity Within Negatives: Measured by ‘avg_STD’ and ‘avg_IQR’, indicating the uniformity and spread of identified negatives.

  2. Dissimilarity With Other Groups: Represented here by ‘avg_abs_AUC_pos/unl’, comparing identified negatives with positives (‘pos’, label 1) and unlabeled samples (‘unl’, label 2).

For a more comprehensive analysis, include X_neg as a feature matrix of ground-truth negatives to assess their dissimilarity with the identified negatives:

X_neg = [[0.5, 0.8], [0.4, 0.4]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_abs_AUC_unl avg_abs_AUC_neg
1 Dataset 1 2 0.175000 0.175000 0.437500 0.250000 0.187500
2 Dataset 2 2 0.187500 0.187500 0.500000 0.250000 0.187500
3 Dataset 3 2 0.037500 0.037500 0.437500 0.500000 0.500000

If the variance within the data is high enough, the Kullback-Leibler Divergence (KLD) can be computed to assess the dissimilarity of distributions between the identified negatives and the other groups:

# Extend the unlabeled group by one sample to fulfill variance requirements
X = np.array([[0.2, 0.1], [0.1, 0.15], [0.25, 0.2], [0.2, 0.3], [0.5, 0.7], [0.6, 0.8]])
list_labels = [[1, 1, 2, 0, 0, 2], [1, 1, 0, 2, 0, 2], [1, 1, 0, 0, 2, 2]]
df_eval = dpul.eval(X, list_labels=list_labels, names_datasets=names_datasets, X_neg=X_neg, comp_kld=True)
aa.display_df(df_eval)
  name n_rel_neg avg_STD avg_IQR avg_abs_AUC_pos avg_KLD_pos avg_abs_AUC_unl avg_KLD_unl avg_abs_AUC_neg avg_KLD_neg
1 Dataset 1 2 0.175000 0.175000 0.437500 1.414400 0.125000 0.003100 0.187500 0.181300
2 Dataset 2 2 0.187500 0.187500 0.500000 1.366900 0.125000 0.003300 0.187500 0.104100
3 Dataset 3 2 0.037500 0.037500 0.437500 1.016800 0.500000 30.317900 0.500000 12.020200