comp_auc_adjusted

comp_auc_adjusted(X, labels, label_test=1, label_ref=0, n_jobs=None)[source]

Compute an adjusted Area Under the Curve (AUC) [-0.5, 0.5] assessing the similarity between two groups.

Introduced in [Breimann25], this adjusted AUC (denoted ‘AUC*’) is computed for each feature in the dataset X, comparing two groups specified by the labels. It is based on the non-parametric measure of the difference between two groups. The adjustment of AUC subtracts 0.5, so it ranges between -0.5 and 0.5. An AUC* of 0 indicates an equal distribution between the two groups. This measure is useful for ranking features based on their ability to distinguish between the two groups.

Added in version 1.0.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. ‘Rows’ typically correspond to proteins and ‘columns’ to features.
labels (array-like, shape (n_samples,)) – Dataset labels of samples in X. Should contain only two different integer label values, representing test and reference group (typically, 1 and 0).
label_test (int, default=1,) – Class label of test group in labels.
label_ref (int, default=0,) – Class label of reference group in labels.
n_jobs (int, None, or -1, default=None) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

auc – Array with AUC* values for each feature, ranging from [-0.5, 0.5]. A value of 0 indicates equal distributions between the two groups for that feature.

Return type:

array-like, shape (n_features,)

Examples

You can compare the similarity of two distributions (here two normal distributions, group_test and group_ref) utilizing an adjusted Area Under the Curve (AUC*) measure ranging from -0.5 to 0.5, as introduced in [Breimann25]. Provide only feature matrix X and its respective group labels to the comp_auc_adjusted function:

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import aaanalysis as aa
# Generate random data for two groups
group_test = np.random.normal(-2, 0.5, 1000)  # Mean = -2, Std = 0.5, 1000 samples
group_ref = np.random.normal(2, 0.5, 1000)  # Mean = 2, Std = 0.5, 1000 samples

# Combine data into a single dataset and reshape it
X = np.hstack([group_test, group_ref]).reshape(-1, 1)  # Reshape to 2D array
labels = np.array([1]*1000 + [0]*1000)
auc_score = aa.comp_auc_adjusted(X=X, labels=labels)[0]

# Plot
aa.plot_settings()
sns.histplot(group_test, color="tab:red", kde=True, label='Test group', alpha=0.5)
sns.histplot(group_ref, color="tab:gray", kde=True, label='Reference group', alpha=0.5)
plt.title(f"AUC* = {auc_score} (All test samples are smaller)")
aa.plot_legend(dict_color=dict(Test="tab:red", Ref="tab:gray"), ncol=1, x=0.85, y=1)
sns.despine()
plt.show()

../_images/comp_auc_adjusted_1_output_1_0.png

The greater the overlap between both distributions, the closer the auc_score is to 0:

group_test = np.random.normal(-0.5, 0.5, 1000)
group_ref = np.random.normal(0.5, 0.5, 1000)
X = np.hstack([group_test, group_ref]).reshape(-1, 1)  # Reshape to 2D array
labels = np.array([1]*1000 + [0]*1000)
auc_score = aa.comp_auc_adjusted(X, labels)[0]

# Plot
aa.plot_settings()
sns.histplot(group_test, color="tab:red", kde=True, label='Test group', alpha=0.5)
sns.histplot(group_ref, color="tab:gray", kde=True, label='Reference group', alpha=0.5)
plt.title(f"AUC* = {auc_score} (Most test samples are smaller)")
sns.despine()
plt.show()

../_images/comp_auc_adjusted_2_output_3_0.png

A auc_score of 0 indicates a perfect overlap:

group_test = np.random.normal(0, 0.5, 1000)
group_ref = np.random.normal(0, 0.5, 1000)
X = np.hstack([group_test, group_ref]).reshape(-1, 1)  # Reshape to 2D array
labels = np.array([1]*1000 + [0]*1000)
auc_score = aa.comp_auc_adjusted(X, labels)[0]

# Plot
aa.plot_settings()
sns.histplot(group_test, color="tab:red", kde=True, label='Test group', alpha=0.5)
sns.histplot(group_ref, color="tab:gray", kde=True, label='Reference group', alpha=0.5)
plt.title(f"AUC* = {auc_score} (Distributions are almost identical)")
sns.despine()
plt.show()

../_images/comp_auc_adjusted_3_output_5_0.png

If all values from the test group (the higher integer value) are greater than the values of the reference group, the auc_score is 0.5:

group_test = np.random.normal(2, 0.5, 1000)
group_ref = np.random.normal(-2, 0.5, 1000)
X = np.hstack([group_test, group_ref]).reshape(-1, 1)  # Reshape to 2D array
labels = np.array([1]*1000 + [0]*1000)
auc_score = aa.comp_auc_adjusted(X, labels)[0]

# Plot
aa.plot_settings()
sns.histplot(group_test, color="tab:red", kde=True, label='Test group', alpha=0.5)
sns.histplot(group_ref, color="tab:gray", kde=True, label='Reference group', alpha=0.5)
plt.title(f"AUC* = {auc_score} (All test samples are greater)")
sns.despine()
plt.show()

../_images/comp_auc_adjusted_4_output_7_0.png

comp_auc_adjusted treats the higher class label as the test group (label_test=1) and the lower as the reference (label_ref=0); pass these explicitly to match a different label encoding. n_jobs parallelizes the per-feature computation across CPU cores (-1 uses all available).

# Further parameters: name the test/reference classes in `labels` explicitly with
# label_test/label_ref, and parallelize the per-feature AUC* over CPU cores with n_jobs.
group_test = np.random.normal(-2, 0.5, 1000)
group_ref = np.random.normal(2, 0.5, 1000)
X = np.hstack([group_test, group_ref]).reshape(-1, 1)
labels = np.array([1]*1000 + [0]*1000)
auc_scores = aa.comp_auc_adjusted(X=X, labels=labels, label_test=1, label_ref=0, n_jobs=1)
print("AUC* =", round(float(auc_scores[0]), 3))

AUC* = -0.5