aaanalysis.dPULearn.compare_sets_negatives

static dPULearn.compare_sets_negatives(list_labels=None, names_datasets=None, df_seq=None, remove_non_neg=True, return_upset_data=False)[source]

Create DataFrame for comparing sets of identified negatives.

Optionally, data format can be created for Upset Plots, which are useful for visualizing the intersection and unique elements across these sets.

Parameters:
  • list_labels (array-like, shape (n_datasets,)) – List of dataset labels for samples in X obtained by the dPULearn.fit() method. Label values should be either 0 (identified negative), 1 (positive) or 2 (unlabeled). Must contain 0.

  • names_datasets (list, optional) – List of dataset names corresponding to list_labels.

  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with unique protein identifiers and a sequence column with full protein sequences, for the entries corresponding to the labels of list_labels.

  • remove_non_neg (bool, default=True) – If True, all rows are removed that do not contain identified negatives in any provided dataset.

  • return_upset_data (bool, default=False) – Whether to return a DataFrame for Upset Plot (if True) or for a general comparison of sets of negatives.

Returns:

  • If return_upset_data=False (default): Returns a pd.DataFrame (df_neg_comp) that combines df_seq (if provided) with a comparison of the negative sets for a general analysis.

  • If return_upset_data=True: Returns a pd.Series DataFrame (upset_data) formatted for generating Upset Plots, containing group size information for the intersection and unique elements across the label sets.

Return type:

pd.DataFrame or pd.Series

See also

Examples

The dPULearn().compare_sets_negatives() method facilitates the comparison of identified negative samples across datasets. Providing identified negatives represented by ‘0’ in the list_labels input, it returns a DataFrame (typically named df_neg_comp) where each row is a sample and each column a dataset, indicating whether the sample is identified as a negative (True) or not (False) in the respective dataset:

import aaanalysis as aa
list_labels = [[1, 1, 0, 0, 2], [1, 1, 0, 2, 0], [1, 1, 2, 0, 0]]
dpul = aa.dPULearn()
df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels)
aa.display_df(df_neg_comp)
  Set 1 Set 2 Set 3
3 True True False
4 True False True
5 False True True

By default, only rows containing at least one identified negative are returned. To return all rows, set remove_non_neg=False:

df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels, remove_non_neg=False)
aa.display_df(df_neg_comp)
  Set 1 Set 2 Set 3
1 False False False
2 False False False
3 True True False
4 True False True
5 False True True

Names of the datasets can be provided by the names_datasets argument:

names = ["Dataset 1", "Dataset 2", "Dataset 3"]
df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels, names_datasets=names)
aa.display_df(df_neg_comp)
  Dataset 1 Dataset 2 Dataset 3
3 True True False
4 True False True
5 False True True

A DataFrame with sequence information (df_seq) and an required ‘entry’ column can be provdied, which is then merged with the df_neg_comp output DataFrame:

import pandas as pd
df_seq = pd.DataFrame([("entry1", "AA"), ("entry2", "BB"), ("entry3", "CC"), ("entry4", "DD"), ("entry5", "EE")], columns=["entry", "sequence"])
df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels, df_seq=df_seq)
aa.display_df(df_neg_comp)
  entry sequence Set 1 Set 2 Set 3
3 entry3 CC True True False
4 entry4 DD True False True
5 entry5 EE False True True

Such overlaps are conveniently visualized using Venn diagrams, but they are limited to a maximum of three datasets. For comparing more than three datasets, an Upset Plot is a better choice. To facilitate this, set return_upset_data=True to generate a data structure directly compatible with the Upset Plot visualizations:

from upsetplot import plot
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
list_labels = [[1, 1, 0, 2, 2], [1, 1, 2, 0, 0], [1, 1, 2, 0, 0], [1, 1, 0, 0, 0]]
upset_data = dpul.compare_sets_negatives(list_labels=list_labels, return_upset_data=True)
plot(upset_data, show_counts='%d')
plt.suptitle("Overlap of identified negatives in different datasets")
plt.show()
../_images/dpul_compare_sets_negatives_1_output_9_0.png