aaanalysis.dPULearnPlot.pca

static dPULearnPlot.pca(df_pu=None, labels=None, figsize=(5, 5), pc_x=1, pc_y=2, show_pos_mean_x=True, show_pos_mean_y=True, colors=None, names=None, legend=True, legend_y=-0.15, kwargs_scatterplot=None)[source]

Principal component analysis (PCA) plot for set of identified negatives.

This method visualizes the differences between the set of identified negatives (labeled by 0) and the positive (1) and the unlabeled (2) sample groups. The selected principal components (PCs) represent a lower-dimensional feature space. Optionally, the average PC value for the positive samples can be shown, which was used for PCA-based identification of negatives.

Parameters:
  • df_pu (pd.DataFrame, shape (n_samples, pca_features)) – A DataFrame with the PCA-transformed features obtained from dPULearn.df_pu_.

  • figsize (tuple, default=(6, 6)) – Figure dimensions (width, height) in inches.

  • labels (array-like, shape (n_samples,)) – Dataset labels of samples in df_pu. Labels should contain 0 (identified negative) and 1 (positive). Unlabeled samples (2) can also be provided.

  • pc_x (str, default='PC1') – The name of the principal component (PC) to show at the x-axis.

  • pc_y (str, default='PC2') – The name of the principal component (PC) to show at the y-axis.

  • show_pos_mean_x (bool, default False) – If True, the mean of the x-axis PC values across the positive sample group is shown on the plot.

  • show_pos_mean_y (bool, default False) – If True, the mean of the y-axis PC values across the positive sample group is shown on the plot.

  • colors (list of str, optional) – List of colors for identified negatives (0), positive samples (1), and unlabeled samples (2).

  • names (list of str, optional) – List of dataset names for identified negatives, positive samples, and unlabeled samples.

  • legend (bool, default=True) – If True, legend is set under dissimilarity measures.

  • legend_y (float, default=-0.175) – Legend position regarding the plot y-axis applied if legend=True.

  • kwargs_scatterplot (dict, optional) – Dictionary with keyword arguments for adjusting scatter plot (matplotlib.pyplot.scatter()).

Returns:

ax – PCA plot axes object.

Return type:

plt.Axes

See also

Examples

To get insights into the identification process by the dPULearn().fit() method, you can create a Principal Component Analysis (PCA) plot for identified negative, positive, and unlabeled dataset groups. To this end, we load an example dataset and perform a PCA-based identification of negatives:

import matplotlib.pyplot as plt
import aaanalysis as aa
aa.options["verbose"] = False
# Dataset with positive (γ-secretase substrates)
# and unlabeled data (proteins with unknown substrate status)
df_seq = aa.load_dataset(name="DOM_GSEC_PU")
labels = df_seq["label"].to_numpy()
n_pos = sum([x == 1 for x in labels])
df_feat = aa.load_features(name="DOM_GSEC")
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)
# PCA-based identification of 'n_pos' negatives
dpul = aa.dPULearn().fit(X, labels=labels, n_unl_to_neg=n_pos)
df_pu = dpul.df_pu_
labels = dpul.labels_

To visualize all identified negatives within the compressed feature space represented by the first two Principal Components (PCs), you can use the dPULearnPlot().pca() method:

dpul_plot = aa.dPULearnPlot()
dpul_plot.pca(df_pu=df_pu, labels=labels)
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_1_output_3_0.png

Which can be easily adjusted by our aa.plot_settings() function:

aa.plot_settings(font_scale=0.8)
dpul_plot.pca(df_pu=df_pu, labels=labels)
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_2_output_5_0.png

The dashed lines indicate the mean values across the positive samples for the PC1 and PC2, based on which the samples from the unlabeled group with the greatest distance were identified as reliable negatives by dPULearn. This becomes more clear using boolean masks and the show_pos_mean_x and show_pos_mean_y parameters:

# Filter only positives and negatives selected based on PC1
mask1 = [x in ["PC1", None] for x in df_pu["selection_via"]]
mask2 = [x in [0, 1] for x in labels]
mask = [m1 and m2 for m1, m2 in zip(mask1, mask2)]
dpul_plot.pca(df_pu=df_pu[mask], labels=labels[mask], show_pos_mean_y=False)
plt.tight_layout()
plt.show()

# Filter only positives and negatives selected based on PC1
mask1 = [x in ["PC2", None] for x in df_pu["selection_via"]]
mask = [m1 and m2 for m1, m2 in zip(mask1, mask2)]
dpul_plot.pca(df_pu=df_pu[mask], labels=labels[mask], show_pos_mean_x=False, legend=False)
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_3_output_7_0.png ../_images/dpul_plot_pca_4_output_7_1.png

You can change the PCs to be shown on the x- and y-axis by providing integers numbers to the pc_x and pc_y parameters:

mask1 = [x in ["PC3", None] for x in df_pu["selection_via"]]
mask2 = [x in [0, 1] for x in labels]
mask = [m1 and m2 for m1, m2 in zip(mask1, mask2)]
dpul_plot.pca(df_pu=df_pu[mask], labels=labels[mask], pc_x=3, pc_y=4, show_pos_mean_y=False)
plt.tight_layout()
plt.show()
mask1 = [x in ["PC4", None] for x in df_pu["selection_via"]]
mask2 = [x in [0, 1] for x in labels]
mask = [m1 and m2 for m1, m2 in zip(mask1, mask2)]
dpul_plot.pca(df_pu=df_pu[mask], labels=labels[mask], pc_x=3, pc_y=4, show_pos_mean_x=False, legend=False)
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_5_output_9_0.png ../_images/dpul_plot_pca_6_output_9_1.png

Adjustment of colors and names must be aligned:

colors = ["r", "black", "b"]
names = ["Red group", "Black group", "Blue group"]
dpul_plot.pca(df_pu=df_pu, labels=labels, colors=colors, names=names)
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_7_output_11_0.png

The legend can be shifted along the y-axis using legend_y (default=-0.15), useful if the figsize (default=(5,5)) is adjusted:

dpul_plot.pca(df_pu=df_pu, labels=labels, figsize=(4, 4), legend_y=-0.3)
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_8_output_13_0.png

The scatter plot using the args_scatter parameter, which is a key word argument dictionary passed to the internally called the plt.scatter class:

dpul_plot.pca(df_pu=df_pu, labels=labels, kwargs_scatterplot={"s": 25, "edgecolor": "black"})
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_9_output_15_0.png

To change the legend, just disable it (setting legend=False) and re-create it using the aa.plot_legend() function:

DICT_COLOR = aa.plot_get_cdict()
dict_color = {"Neg": DICT_COLOR["SAMPLES_REL_NEG"], "Pos": DICT_COLOR["SAMPLES_POS"], "Unl": DICT_COLOR["SAMPLES_UNL"]}
dpul_plot.pca(df_pu=df_pu, labels=labels, legend=False)
aa.plot_legend(dict_color=dict_color, y=1.2, handlelength=1, marker="o")
plt.tight_layout()
plt.show()
../_images/dpul_plot_pca_10_output_17_0.png