aaanalysis.CPPPlot.eval

static CPPPlot.eval(df_eval=None, figsize=(6, 4), dict_xlims=None, legend=True, legend_y=-0.3, dict_color=None, list_cat=None)[source]

Plot evaluation output of CPP comparing multiple sets of identified feature sets.

Evaluation measures are categorized into two groups:

  • Discriminative Power measures (‘range_ABS_AUC’ and ‘avg_MEAN_DIF’), which assess the effectiveness of the feature set in distinguishing between the test and reference datasets.

  • Redundancy measures (‘n_clusters’, ‘avg_n_feat_per_clust’, and ‘std_n_feat_per_clust’), which evaluate the internal redundancy of a feature set using Pearson correlation-based clustering.

Parameters:
  • df_eval (pd.DataFrame, shape (n_feature_sets, n_metrics)) –

    DataFrame with evaluation measures for sets of identified features. Each row corresponds to a specific feature set. required ‘columns’ are:

    • ’name’: Name of the feature set.

    • ’n_features’: Number of features per scale category given as list.

    • ’range_ABS_AUC’: Quintile range of absolute AUC among all features (min, 25%, median, 75%, max).

    • ’avg_MEAN_DIF’: Two mean differences averaged across all features (for positive and negative ‘mean_dif’)

    • ’n_clusters’: Optimal number of clusters.

    • ’avg_n_feat_per_clust’: Average number of features per cluster.

    • ’std_n_feat_per_clust’: Standard deviation of feature number per cluster

  • figsize (tuple, default=(6, 4)) – Figure dimensions (width, height) in inches.

  • dict_xlims (dict, optional) – A dictionary containing x-axis limits for subplots. Keys should be the subplot axis number ({0, 1, 2, 4}) and values should be tuple specifying (xmin, xmax). If None, x-axis limits are auto-scaled.

  • legend (bool, default=True) – If True, scale category legend is set under number of features measures.

  • legend_y (float, default=-0.3) – Legend position regarding the plot y-axis applied if legend=True.

  • dict_color (dict, optional) – Color dictionary of scale categories for legend. Default from plot_get_cdict() with name='DICT_CAT'.

  • list_cat (list of str, optional) – List of scale categories for which feature numbers are shown.

Returns:

  • fig (plt.Figure) – Figure object for evaluation plot

  • axes (array of plt.Axes) – Array of Axes objects, each representing a subplot within the figure.

Notes

  • Altering figsize height could result in unappropriated legend spacing. This can be adjusted by the legend_y parameter together with using the matplotlib.pyplot.subplots_adjust() function, here used with (wspace=0.25, hspace=0, bottom=0.35) parameter settings.

See also

Examples

To demonstrate the CPPPlot().eval() method, we first create a list of features sets using the DOM_GSEC_PU feature data set (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC", n=50)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_feat = aa.load_features()
# Feature sets with varying length
list_df_feat = [df_feat, df_feat.head(100), df_feat.head(50), df_feat.head(25)]
# Feature sets with varying scale categories
list_cat = ['ASA/Volume', 'Conformation', 'Energy','Polarity']
list_df_feat.extend([df_feat[df_feat["category"].isin(list_cat)],
                     df_feat[df_feat["category"].isin(list_cat[0:2])]])

We can now create the df_eval DataFrame, which contains different evaluation measures for comparing the feature datasets, by using the CPP().eval() method:

cpp = aa.CPP(df_parts=df_parts)
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels)
aa.display_df(df_eval)
  name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust
1 Set 1 (150, [18, 0, 56, 27, 0, 16, 17, 16]) 0.164000 [0.126, 0.142, 0.162, 0.181, 0.244] (np.float64(0.083), np.float64(-0.08)) 21 7.140000 5.140000
2 Set 2 (100, [12, 0, 33, 21, 0, 11, 13, 10]) 0.178000 [0.149, 0.162, 0.174, 0.188, 0.244] (np.float64(0.086), np.float64(-0.087)) 13 7.690000 5.910000
3 Set 3 (50, [6, 0, 18, 13, 0, 5, 4, 4]) 0.195000 [0.174, 0.181, 0.188, 0.205, 0.244] (np.float64(0.096), np.float64(-0.095)) 10 5.000000 3.690000
4 Set 4 (25, [4, 0, 7, 8, 0, 2, 3, 1]) 0.209000 [0.189, 0.197, 0.205, 0.215, 0.244] (np.float64(0.1), np.float64(-0.108)) 2 12.500000 6.500000
5 Set 5 (117, [18, 0, 56, 27, 0, 16, 0, 0]) 0.165000 [0.126, 0.142, 0.164, 0.182, 0.244] (np.float64(0.084), np.float64(-0.079)) 20 5.850000 3.120000
6 Set 6 (74, [18, 0, 56, 0, 0, 0, 0, 0]) 0.161000 [0.126, 0.137, 0.156, 0.18, 0.243] (np.float64(0.085), np.float64(-0.073)) 13 5.690000 3.220000

df_eval can now be supplied to the CPPPlot().eval() method to visualize the evaluation:

import matplotlib.pyplot as plt
cpp_plot = aa.CPPPlot()
cpp_plot.eval(df_eval=df_eval)
plt.show()
../_images/cpp_plot_eval_1_output_5_0.png

Ticks and the fontsize can be adjusted using the plot_settings() functions.

aa.plot_settings(font_scale=0.7, short_ticks_x=True, no_ticks_y=True, weight_bold=False)
cpp_plot.eval(df_eval=df_eval)
plt.show()
../_images/cpp_plot_eval_2_output_7_0.png

Explanation of Evaluation Plot

The visualization provides insights into feature set characteristics, displayed from left to right:

  1. Number of Features per Scale Category: Displayed as a stacked bar chart.

  2. Range of Absolute AUC Values: Includes minimum, 25%, median, 75%, and maximum values.

  3. Mean Difference: The average difference between the test and the reference dataset, with positive values in red and negative values in blue.

  4. Number of Clusters: Based on Pearson correlation between the features.

  5. Average Number of Features per Cluster.

Metrics (b) and (c) gauge the Discriminative Power of the feature sets (higher is better), while (d) and (e) assess the Numerical Redundancy, with fewer features per cluster indicating greater diversity.

The figsize can be adjusted:

cpp_plot.eval(df_eval=df_eval, figsize=(7, 3))
plt.show()
../_images/cpp_plot_eval_3_output_10_0.png

This might lead to an unappropriated legend spacing. The y position of the legend can be controlled using the legend_y (default=-0.3) parameter. In addition, we recommend to use the :func:matplotlib.pyplot.subplots_adjust to adjust the bottom spacing of the figure:

cpp_plot.eval(df_eval=df_eval, figsize=(7, 3), legend_y=-0.4)
plt.subplots_adjust(wspace=0.25, hspace=0, bottom=0.4)
plt.show()
../_images/cpp_plot_eval_4_output_12_0.png

Alternatively, you can simply remove the legend setting legend=False

cpp_plot.eval(df_eval=df_eval, figsize=(7, 3), legend=False)
plt.show()
../_images/cpp_plot_eval_5_output_14_0.png

Customize the the x-limits of each subplot using the dict_xlims parameter:

dict_xlims = {0: (0, 50), 3: (0, 35)} # Adjust first and fourth subplot
cpp_plot.eval(df_eval=df_eval, figsize=(7, 3), legend=False, dict_xlims=dict_xlims)
plt.show()
../_images/cpp_plot_eval_6_output_16_0.png

You can change the colors of the scale categories using the dict_color argument. We first get the default scale category color dictionary using the plt_get_cdict function with name='DICT_CAT' and a new color list using plot_get_clist:

dict_color = aa.plot_get_cdict(name="DICT_CAT")
list_colors = aa.plot_get_clist(n_colors=10)
# New color dict
dict_color = dict(zip(dict_color, list_colors))
cpp_plot.eval(df_eval=df_eval, dict_color=dict_color)
plt.show()
../_images/cpp_plot_eval_7_output_18_0.png
cpp = aa.CPP(df_parts=df_parts)
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, list_cat=list_cat[0:4])
cpp_plot.eval(df_eval=df_eval, list_cat=list_cat[0:4])
plt.show()
../_images/cpp_plot_eval_8_output_19_0.png

To adjust the set names, provide the customized names via the names_feature_sets argument to the CPP().eval() method:

cpp = aa.CPP(df_parts=df_parts)
names_feature_sets = [f"Feature set {i}" for i in range(1, 7)]
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, names_feature_sets=names_feature_sets)
cpp_plot.eval(df_eval=df_eval)
plt.show()
../_images/cpp_plot_eval_9_output_21_0.png

If only a subset of scale categories should be considered, the list_cat argument should be adjusted in both the CPP().eval() and the CPPPlot().eval() method: