aaanalysis.dPULearn.compare_sets_negatives
- static dPULearn.compare_sets_negatives(list_labels=None, names_datasets=None, df_seq=None, remove_non_neg=True, return_upset_data=False)[source]
Create DataFrame for comparing sets of identified negatives.
Optionally, data format can be created for Upset Plots, which are useful for visualizing the intersection and unique elements across these sets.
- Parameters:
list_labels (array-like, shape (n_datasets,)) – List of dataset labels for samples in
Xobtained by thedPULearn.fit()method. Label values should be either 0 (identified negative), 1 (positive) or 2 (unlabeled). Must contain 0.names_datasets (list, optional) – List of dataset names corresponding to
list_labels.df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an
entrycolumn with unique protein identifiers and asequencecolumn with full protein sequences, for the entries corresponding to thelabelsoflist_labels.remove_non_neg (bool, default=True) – If
True, all rows are removed that do not contain identified negatives in any provided dataset.return_upset_data (bool, default=False) – Whether to return a DataFrame for Upset Plot (if
True) or for a general comparison of sets of negatives.
- Returns:
If
return_upset_data=False(default): Returns a pd.DataFrame (df_neg_comp) that combinesdf_seq(if provided) with a comparison of the negative sets for a general analysis.If
return_upset_data=True: Returns a pd.Series DataFrame (upset_data) formatted for generating Upset Plots, containing group size information for the intersection and unique elements across the label sets.
- Return type:
pd.DataFrame or pd.Series
See also
dPULearn.fit()for details on how labels are generated.SequenceFeature.get_df_parts()for details on format ofdf_seq.Upset Plot documentation:
upsetplot.plot().
Examples
The
dPULearn().compare_sets_negatives()method facilitates the comparison of identified negative samples across datasets. Providing identified negatives represented by ‘0’ in thelist_labelsinput, it returns a DataFrame (typically nameddf_neg_comp) where each row is a sample and each column a dataset, indicating whether the sample is identified as a negative (True) or not (False) in the respective dataset:import aaanalysis as aa list_labels = [[1, 1, 0, 0, 2], [1, 1, 0, 2, 0], [1, 1, 2, 0, 0]] dpul = aa.dPULearn() df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels) aa.display_df(df_neg_comp)
Set 1 Set 2 Set 3 3 True True False 4 True False True 5 False True True By default, only rows containing at least one identified negative are returned. To return all rows, set
remove_non_neg=False:df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels, remove_non_neg=False) aa.display_df(df_neg_comp)
Set 1 Set 2 Set 3 1 False False False 2 False False False 3 True True False 4 True False True 5 False True True Names of the datasets can be provided by the
names_datasetsargument:names = ["Dataset 1", "Dataset 2", "Dataset 3"] df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels, names_datasets=names) aa.display_df(df_neg_comp)
Dataset 1 Dataset 2 Dataset 3 3 True True False 4 True False True 5 False True True A DataFrame with sequence information (
df_seq) and an required ‘entry’ column can be provdied, which is then merged with thedf_neg_compoutput DataFrame:import pandas as pd df_seq = pd.DataFrame([("entry1", "AA"), ("entry2", "BB"), ("entry3", "CC"), ("entry4", "DD"), ("entry5", "EE")], columns=["entry", "sequence"]) df_neg_comp = dpul.compare_sets_negatives(list_labels=list_labels, df_seq=df_seq) aa.display_df(df_neg_comp)entry sequence Set 1 Set 2 Set 3 3 entry3 CC True True False 4 entry4 DD True False True 5 entry5 EE False True True Such overlaps are conveniently visualized using Venn diagrams, but they are limited to a maximum of three datasets. For comparing more than three datasets, an Upset Plot is a better choice. To facilitate this, set
return_upset_data=Trueto generate a data structure directly compatible with the Upset Plot visualizations:from upsetplot import plot import matplotlib.pyplot as plt import warnings warnings.filterwarnings('ignore', category=FutureWarning) list_labels = [[1, 1, 0, 2, 2], [1, 1, 2, 0, 0], [1, 1, 2, 0, 0], [1, 1, 0, 0, 0]] upset_data = dpul.compare_sets_negatives(list_labels=list_labels, return_upset_data=True) plot(upset_data, show_counts='%d') plt.suptitle("Overlap of identified negatives in different datasets") plt.show()