aaanalysis.CPPPlot.feature

CPPPlot.feature(feature=None, df_seq=None, labels=None, label_test=1, label_ref=0, ax=None, figsize=(5.6, 4.8), names_to_show=None, name_test='TEST', name_ref='REF', color_test='tab:green', color_ref='tab:gray', show_seq=False, histplot=False, fontsize_mean_dif=15, fontsize_name_test=13, fontsize_name_ref=13, fontsize_names_to_show=11, alpha_hist=0.1, alpha_dif=0.2)[source]

Plot distributions of CPP feature values for test and reference datasets highlighting their mean difference.

Introduced in [Breimann24a], a CPP feature is defined as a Part-Split-Scale combination. For a sample, a feature value is computed in three steps:

  1. Part Selection: Identify a specific sequence part.

  2. Part-Splitting: Divide the selected part into subsequences, creating a Part-Split combination.

  3. Scale Value Assignment: For each amino acid in the Part-Split subsequence, assign its corresponding scale value and calculate the average, which is termed the feature value.

Parameters:
  • feature (str) – Name of the feature for which test and reference set distributions and difference should be plotted.

  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – DataFrame containing an entry column with unique protein identifiers and sequence information in a distinct Position-based, Part-based, Sequence-based, or Sequence-TMD-based format.

  • labels (array-like, shape (n_samples,)) – Class labels for samples in sequence DataFrame (typically, test=1, reference=0).

  • label_test (int, default=1,) – Class label of test group in labels.

  • label_ref (int, default=0,) – Class label of reference group in labels.

  • ax (plt.Axes, optional) – Pre-defined Axes object to plot on. If None, a new Axes object is created.

  • figsize (tuple, default=(5.6, 4.8)) – Figure dimensions (width, height) in inches.

  • names_to_show (list of str, optional) – Names of specific samples from df_seq to highlight on plot. ‘name’ column must be given in df_seq if names_to_show is not None.

  • name_test (str, default="TEST") – Name for the test dataset.

  • name_ref (str, default="REF") – Name for the reference dataset.

  • color_test (str, default="tab:green") – Color for the test dataset.

  • color_ref (str, default="tab:gray") – Color for the reference dataset.

  • show_seq (bool, default=False) – If True, show sequence of samples selected via names_to_show.

  • histplot (bool, default=False) – If True, plot a histogram. If False, plot a kernel density estimate (KDE) plot.

  • fontsize_mean_dif (int or float, default=15) – Font size (>0) for displayed mean difference text.

  • fontsize_name_test (int or float, default=13) – Font size (>0) for the name of the test dataset.

  • fontsize_name_ref (int or float, default=13) – Font size (>0) for the name of the reference dataset.

  • fontsize_names_to_show (int or float, default=11) – Font size (>0) for the names selected via names_to_show.

  • alpha_hist (int or float, default=0.1) – The transparency alpha value [0-1] for the histogram distributions.

  • alpha_dif (int or float, default=0.2) – The transparency alpha value [0-1] for the mean difference area.

Returns:

ax – CPP feature plot axes object.

Return type:

plt.Axes

See also

Examples

To demonstrate the CCPPlot().feature() method, we load the DOM_GSEC example dataset (see [Breimann25a]):

import matplotlib.pyplot as plt
import aaanalysis as aa
aa.options["verbose"] = False

df_seq = aa.load_dataset(name="DOM_GSEC_PU")
labels = df_seq["label"].to_list()
labels = [0 if x == 2 else x for x in labels] # Adjust labels

For any feature, we can display the distribution of feature values for a test and a reference dataset, provided by the df_seq and the corresponding labels parameters. The feature has to be a valid Part-Split-Scale combination (scales are given by their AAindex id):

# This feature creates the average volume over the entire TMD sequence
feature = "TMD-Segment(1,1)-GRAR740103"
cpp_plot = aa.CPPPlot()
aa.plot_settings(font_scale=1)
cpp_plot.feature(feature=feature, df_seq=df_seq, labels=labels)
plt.title("Average TMD Volume (GRAR740103)")
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_1_output_3_0.png

We can now load the respective feature set for the DOM_GSEC_PU dataset:

sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_feat = aa.load_features(name="DOM_GSEC")
aa.display_df(df_feat, n_rows=15)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000
6 TMD_C_JMD_C-Seg...2,3)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.222000 0.058671 0.058671 0.064895 0.069547 0.000000 0.000001 27,28,29,30,31,32,33 0.000000 0.000000
7 TMD_C_JMD_C-Seg...4,5)-FAUJ880109 Energy Isoelectric point Number hydrogen bond donors Number of hydro...e et al., 1988) 0.215000 0.146661 0.146661 0.174609 0.188034 0.000000 0.000004 33,34,35,36 1.032400 1.510722
8 TMD_C_JMD_C-Seg...3,4)-JANJ780101 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Average accessi...n et al., 1978) 0.215000 0.124317 0.124317 0.166309 0.153364 0.000000 0.000004 31,32,33,34,35 1.080400 1.296094
9 TMD_C_JMD_C-Seg...,10)-WILM950103 Polarity Hydrophobicity (interface) Hydrophobicity (interface) Hydrophobicity ...e et al., 1995) 0.212000 0.141305 -0.141305 0.168603 0.217235 0.000000 0.000005 33,34 1.747200 2.150664
10 TMD_C_JMD_C-Seg...6,9)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.211000 0.125350 0.125350 0.160819 0.174121 0.000000 0.000005 32,33 1.788800 2.700803
11 TMD_C_JMD_C-Seg...2,3)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.211000 0.077355 0.077355 0.102965 0.107453 0.000000 0.000005 27,28,29,30,31,32,33 3.048800 3.623912
12 TMD_C_JMD_C-Seg...3,4)-JANJ790102 Energy Free energy (unfolding) Transfer free e...(TFE) to inside Transfer free e...y (Janin, 1979) 0.206000 0.111462 -0.111462 0.159718 0.144989 0.000000 0.000009 31,32,33,34,35 0.000000 0.000000
13 TMD_C_JMD_C-Seg...6,9)-CHOC760103 ASA/Volume Buried Buried Proportion of r...(Chothia, 1976) 0.205000 0.125868 -0.125868 0.172165 0.188333 0.000000 0.000009 32,33 0.000000 0.000000
14 TMD_C_JMD_C-Seg...4,5)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.204000 0.105513 0.105513 0.132849 0.145219 0.000000 0.000009 33,34,35,36 1.992000 2.929460
15 TMD_C_JMD_C-Seg...6,9)-DESM900102 Polarity Amphiphilicity (α-helix) Membrane preference Average membran...i et al., 1990) 0.200000 0.132693 -0.132693 0.184359 0.209008 0.000000 0.000015 32,33 0.000000 0.000000

We can plot the feature value distributions for the test and the reference datasets for the best feature using the CCPPlot().feature() method. You need to provide the CPP feature id (Part-Split-Scale combination), the df_seq DataFrame, and its respective labels:

list_features = df_feat["feature"].to_list()
cpp_plot = aa.CPPPlot()
aa.plot_settings(font_scale=1)
cpp_plot.feature(feature=list_features[0], df_seq=df_seq, labels=labels)
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_2_output_7_0.png

Test vs Reference Dataset

The distributions for the test dataset (TEST, green) and the reference dataset (REF, gray) are compared by highlighted the difference of the mean values (called Mean difference).

Set the feature name as title using the SequenceFeature().get_feature_names() method:

sf = aa.SequenceFeature()
feature_names = sf.get_feature_names(features=list_features)
cpp_plot.feature(feature=list_features[0], df_seq=df_seq, labels=labels)
plt.title(feature_names[0])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_3_output_10_0.png

You can use the ax parameter to create subplots for displaying multiple features:

aa.plot_settings(font_scale=0.8)
# Create subplots
fig, axes = plt.subplots(1, 2, figsize=(8, 4))
cpp_plot.feature(ax=axes[0], feature=list_features[2], df_seq=df_seq, labels=labels)
cpp_plot.feature(ax=axes[1], feature=list_features[12], df_seq=df_seq, labels=labels)
axes[0].set_title(feature_names[2])
axes[1].set_title(feature_names[12])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_4_output_12_0.png

Positive vs Negative Mean Difference

The mean difference of feature values can be either positive or negative:

Positive (indicated in red) means that the feature values of the test class are higher in average (e.g., left plot).

Negative (indicated in blue) means that the feature values of the test class are lower in average (e.g., left plot).

You can customize the plot by changing the figsize, dataset names (via name_test and name_ref), or dataset colors (via color_test and color_ref):

aa.plot_settings()
cpp_plot.feature(feature=list_features[2], df_seq=df_seq, labels=labels,
                 figsize=(5, 4), name_test="Test data", name_ref="Reference data",
                 color_test="tab:orange", color_ref="tab:blue")
plt.title(feature_names[2])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_5_output_15_0.png

You change the density plot to a histogram by setting histplot=True

cpp_plot.feature(feature=list_features[2], df_seq=df_seq, labels=labels, histplot=True)
plt.title(feature_names[2])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_6_output_17_0.png

Adjust the transparency (alpha value) of the histogram and the mean difference using the alpha_hist (default=0.1) and alpha_dif (default=0.2) parameters:

cpp_plot.feature(feature=list_features[2], df_seq=df_seq, labels=labels, alpha_dif=0.5, alpha_hist=0.7)
plt.title(feature_names[2])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_7_output_19_0.png

To highlight samples within the distributions, the df_seq DataFrame needs to contain a name column. Selected names from this column are displayed if provided via the names_to_show parameter:

df_seq["name"] = [f"Protein {i}" for i in range(len(df_seq))]
cpp_plot.feature(feature=list_features[2], df_seq=df_seq, labels=labels, names_to_show=["Protein 1", "Protein 100"])
plt.title(feature_names[2])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_8_output_21_0.png

You can show the sub-sequence for the Part-Split combination of the selected proteins by setting show_seq=True:

df_seq["name"] = [f"Protein {i}" for i in range(len(df_seq))]
cpp_plot.feature(feature=list_features[2], df_seq=df_seq, labels=labels, names_to_show=["Protein 1", "Protein 100"], show_seq=True)
plt.title(feature_names[2])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_9_output_23_0.png

Following parameters are provided to adjust the font size: fontsize_mean_dif (default=15), fontsize_name_test (default=13), fontsize_name_ref (default=13), and fontsie_names_to_show (default=11):

cpp_plot.feature(feature=list_features[2], df_seq=df_seq, labels=labels, names_to_show=["Protein 1", "Protein 100"], show_seq=True,
                 fontsize_mean_dif=10, fontsize_name_test=15, fontsize_name_ref=13, fontsize_names_to_show=15)
# Adjust the feature name for the TMD length
feature_names = sf.get_feature_names(list_features[2], tmd_len=23)
plt.title(feature_names[0])
plt.tight_layout()
plt.show()
../_images/cpp_plot_feature_10_output_25_0.png