aaanalysis.CPPPlot.feature_map
- CPPPlot.feature_map(df_feat=None, col_cat='subcategory', col_val='mean_dif', col_imp='feat_importance', name_test='TEST', name_ref='REF', figsize=(8, 8), add_imp_bar_top=True, imp_bar_th=None, imp_bar_label_type='long', imp_ths=(0.2, 0.5, 1), imp_marker_sizes=(3, 5, 8), start=1, tmd_len=20, tmd_seq=None, jmd_n_seq=None, jmd_c_seq=None, tmd_color='mediumspringgreen', jmd_color='blue', tmd_seq_color='black', jmd_seq_color='white', seq_size=None, fontsize_tmd_jmd=None, weight_tmd_jmd='normal', fontsize_titles=11, fontsize_labels=12, fontsize_annotations=11, fontsize_imp_bar=9, add_xticks_pos=False, grid_linewidth=0.01, grid_linecolor=None, border_linewidth=2, facecolor_dark=False, vmin=None, vmax=None, cmap=None, cmap_n_colors=101, cbar_pct=True, cbar_kws=None, cbar_xywh=(0.5, None, 0.2, None), dict_color=None, legend_kws=None, legend_xy=(-0.1, -0.01), legend_imp_xy=(1.25, 0), xtick_size=11.0, xtick_width=2.0, xtick_length=5.0)[source]
Plot CPP feature map showing feature value mean difference and feature importance per scale subcategory (y-axis) and residue position (x-axis).
- Parameters:
df_feat (pd.DataFrame, shape (n_feature, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature. Must also include a feature importance column (
col_imp).col_cat ({'category', 'subcategory', 'scale_name'}, default='subcategory') – Column name in
df_featfor scale information (y-axis).col_val ({'mean_dif', 'abs_mean_dif', 'abs_auc'}, default='mean_dif') – Column name in
df_featfor numerical values to display.col_imp ({
feat_importance,feat_importance_'name'}, default=’feat_importance’) – Column name indf_featfor feature importance (group-, subgroup- or sample-level).name_test (str, default="TEST") – Name for the test dataset.
name_ref (str, default="REF") – Name for the reference dataset.
figsize (tuple, default=(8, 8)) – Figure dimensions (width, height) in inches.
add_imp_bar_top (bool, default=True) – If
True, add bars for cumulative feature importance per position (top).imp_bar_th (int or float, optional) – Threshold for cumulative feature importance per scale (right bars). If
None, determined automatically.imp_bar_label_type ({'long', 'short', None} default='long') – Label type for cumulative feature importance bar chart. If
None, no label is shown.imp_ths (tuple, default=(0.2, 0.5, 1)) – Three ascending thresholds for feature importance (scale- and position-specific).
imp_marker_sizes (tuple, default=(3, 5, 8)) – Size of three feature importance markers defined by
impd_th.start (int, default=1) – Position label of first residue position (starting at N-terminus).
tmd_len (int, default=20) – Length of TMD to be depicted (>0). Must match with all feature from
df_feat.tmd_seq (str, optional) – TMD sequence for specific sample.
jmd_n_seq (str, optional) – JMD N-terminal sequence for specific sample. Length must match with ‘jmd_n_len’ attribute.
jmd_c_seq (str, optional) – JMD C-terminal sequence for specific sample. Length must match with ‘jmd_c_len’ attribute.
tmd_color (str, default='mediumspringgreen') – Color for TMD.
jmd_color (str, default='blue') – Color for JMDs.
tmd_seq_color (str, default='black') – Color for TMD sequence.
jmd_seq_color (str, default='white') – Color for JMD sequence.
seq_size (int or float, optional) – Font size (>=0) for sequence characters. If
None, optimized automatically.fontsize_tmd_jmd (int or float, optional) – Font size (>=0) for the part labels: ‘JMD-N’, ‘TMD’, ‘JMD-C’. If
None, optimized automatically.weight_tmd_jmd ({'normal', 'bold'}, default='normal') – Font weight for the part labels: ‘JMD-N’, ‘TMD’, ‘JMD-C’.
fontsize_titles (int or float, default=11) – Font size (>= 0) for figure titles. If
None, determined automatically.fontsize_labels (int or float, default=12) – Font size (>= 0) for figure labels. If
None, determined automatically.fontsize_annotations (int or float, default=10) – Font size (>= 0) for figure annotations. If
None, determined automatically.fontsize_imp_bar (int or float, default=9) – Font size (>= 0) for feature importance in bars. If
None, determined automatically.add_xticks_pos (bool, default=False) – If
True, include x-tick positions when TMD-JMD sequence is given.grid_linewidth (int or float, default=0.01) – Line width for the grid.
grid_linecolor (str, optional) – Color for the grid lines. If
None, automatically determined based onfacecolor_dark.border_linewidth (int or float, default=2) – Line width for the TMD-JMD border.
facecolor_dark (bool, optional) – Sets background of heatmap to black (if
True) or white. IfNone, automatically determined fromshap_plotsetting. Affects grid cells for missing or near-zero data based oncol_val.vmin (int or float, optional) – Minimum
col_valvalue setting the lower end of the colormap. IfNone, determined automatically.vmax (int or float, optional) – Maximum
col_valvalue setting the upper end of the colormap. IfNone, determined automatically.cmap (matplotlib colormap name or object, optional) – Name of the colormap to use. If
None, automatically determinedcol_valdata and ‘shap_plot’ setting.cmap_n_colors (int, default=101) – Number of discrete steps (>1) in diverging or sequential colormap.
cbar_pct (bool, default=True) – If
True, colorbar is represented in percentage and thecol_valvalues are converted accordingly by multiplying with 100 if necessary.cbar_kws (dict of key, value mappings, optional) – Keyword arguments for colorbar passed to
matplotlib.figure.Figure.colorbar().cbar_xywh (tuple, default=(0.7, None, 0.2, None)) – Colorbar position and size: x-axis (left), y-axis (bottom), width, height. Values are optimized if
None.dict_color (dict, optional) – Color dictionary of scale categories classifying scales shown on y-axis. Default from
plot_get_cdict()withname='DICT_CAT'.legend_kws (dict, optional) – Keyword arguments for the legend passed to
plot_legend().legend_xy (tuple, default=(-0.1, -0.01)) – Position for scale category legend: x- and y-axis coordinates. Values are set to default if
None.legend_imp_xy (tuple, default=(1.25, 0)) – Position for feature importance legend: x- and y-axis coordinates (relative to cbar).
xtick_size (int or float, default=11.0) – Size of x-tick labels (>0).
xtick_width (int or float, default=2.0) – Width of the x-ticks (>0).
xtick_length (int or float, default=5.0) – Length of the x-ticks (>0).
- Returns:
fig (plt.Figure) – The Figure object for the CPP feature map.
ax (plt.Axes) – Array of Axes objects for the CPP feature map.
Notes
tmd_seq_colorandjmd_seq_colorare applicable only whentmd_seq,jmd_n_seq, andjmd_c_seqare provided.See also
CPP.run()for details on CPP statistical measures of thedf_featDataFrame.SequenceFeaturefor definition of sequenceParts.CPPPlot.feature()for visualization of mean differences for specific features.seaborn.heatmap()for seaborn heatmap.matplotlib.figure.Figure.colorbar()for colorbar arguments.Matplotlib Colormaps for further
cmapoptions.plot_legend()used for setting scale category legend.
Examples
To demonstrate the
CPPPlot().feature_map()method, we first load the exampleDOM_GSECdataset and its respective features (see [Breimann25a]):import matplotlib.pyplot as plt import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC") df_feat = aa.load_features(name="DOM_GSEC") df_feat = df_feat.sort_values(by="feat_importance", ascending=False).reset_index(drop=True) aa.display_df(df_feat, show_shape=True, n_rows=7)
DataFrame shape: (150, 15)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std 1 TMD_C_JMD_C-Seg...,11)-LIFS790102 Conformation β-strand β-strand Conformational ...n-Sander, 1979) 0.189000 0.125674 0.125674 0.183876 0.218813 0.000001 0.000039 28,29 4.729200 4.776785 2 TMD_C_JMD_C-Seg...2,3)-CHOP780212 Conformation β-sheet (C-term) β-turn (1st residue) Frequency of th...-Fasman, 1978b) 0.199000 0.065983 -0.065983 0.087814 0.105835 0.000000 0.000016 27,28,29,30,31,32,33 4.106000 5.236574 3 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 4 TMD_C_JMD_C-Seg...2,3)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.211000 0.077355 0.077355 0.102965 0.107453 0.000000 0.000005 27,28,29,30,31,32,33 3.048800 3.623912 5 TMD_C_JMD_C-Pat...4,8)-JANJ790102 Energy Free energy (unfolding) Transfer free e...(TFE) to inside Transfer free e...y (Janin, 1979) 0.187000 0.144354 -0.144354 0.181777 0.233103 0.000001 0.000049 33,37 2.833600 3.640617 6 TMD_C_JMD_C-Pat...4,8)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.176000 0.087846 0.087846 0.140464 0.157561 0.000004 0.000113 24,28 2.704000 4.076269 7 TMD_C_JMD_C-Pat...,10)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.149000 0.073526 0.073526 0.133612 0.157088 0.000090 0.000714 31,34,38 2.050800 2.338278 CPP Analysis (group-level)
The group-level feature value difference per scale subcategory (y-axis) and residue position (x-axis) can be visualized by providing the
df_featDataFrame:# Plot CPP feature map at group-level (as originally introduced without feature importance bars on top) cpp_plot = aa.CPPPlot() aa.plot_settings(font_scale=0.7, weight_bold=False) cpp_plot.feature_map(df_feat=df_feat, add_imp_bar_top=False) plt.show()
Version 1.0.2 introduced an enhanced CPP feature map that includes cumulative feature importance per residue, shown as a bar plot above the heatmap. This feature is enabled by default and can be disabled by setting
add_imp_bar_top=False(shown before).# Plot CPP feature map (v1.0.2+: with importance bars on top) aa.plot_settings(font_scale=0.7, weight_bold=False) cpp_plot.feature_map(df_feat=df_feat) plt.show()
You can select a subset of features by filtering
df_feat:# Plot top 15 features df_top15 = df_feat.head(15) cpp_plot.feature_map(df_feat=df_top15) plt.show()
Adjust the scale classification level (y-axis) using the
col_catparameter. Choose from the ‘category’, ‘subcategory’ (default), and ‘scale_name’ columns from thedf_feat:# Show feature map with scales classified by categories cpp_plot.feature_map(df_feat=df_feat, col_cat="category") plt.show()
The numerical value shown in the feature map can be adjusted by the
col_valparameter, which specifies one of the followingdf_featcolumns: ‘mean_dif’ (default), ‘abs_mean_dif’, ‘abs_auc’, or ‘feat_importance’:# Show feature map with absolute feature value difference cpp_plot.feature_map(df_feat=df_feat, col_val="abs_mean_dif") plt.show()
Adjust the names of the test and reference datasets using the
name_test(default=‘TEST’) andname_ref(default=‘REF’) parameters:# Adjust dataset names shown in colorbar cpp_plot.feature_map(df_feat=df_feat, name_test="Target group", name_ref="Control group") plt.show()
To visualize a subset of features, adjust the
figsize(default=(8, 8)). Change the annotation threshold for the cumulative feature importance (right bar chart) using theimp_bar_thparameter and the respective fontsize using thefontsize_imp_bar(default=9):# Show only top 15 features df_top15 = df_feat.head(15) cpp_plot.feature_map(df_feat=df_top15, figsize=(8, 4), imp_bar_th=7, fontsize_imp_bar=8) plt.show()
You can adjust the
startposition and thetmd_len(default=20) by providing them as parameters. Change the length of thejmd_nandjmd_cusing theCPPPlotobject.# Start at residue position 10 and adjust the length each part cpp_plot = aa.CPPPlot(jmd_n_len=15, jmd_c_len=15) cpp_plot.feature_map(df_feat=df_feat, start=10, tmd_len=30) plt.show()
CPP Analysis (sample-level)
You can visualize how the general feature value difference is translated onto the sequence of a specific sample. To this end, you need to provide the corresponding sequence parameters:
jmd_n_seq,tmd_seq, andjmd_c_seq:# Get sequence parts of first sample cpp_plot = aa.CPPPlot() jmd_n_seq, tmd_seq, jmd_c_seq = df_seq.loc[0, ["jmd_n", "tmd", "jmd_c"]] args_seq = dict(jmd_n_seq=jmd_n_seq, tmd_seq=tmd_seq, jmd_c_seq=jmd_c_seq) print("Sequence parts of first sample") print(jmd_n_seq, tmd_seq, jmd_c_seq) # Plot CPP profile for first sample cpp_plot.feature_map(df_feat=df_feat, **args_seq) plt.show()Sequence parts of first sample FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
You can customize the following color parameters:
tmd_color(default=‘mediumspringgreen’),jmd_color(default=‘blue’),tmd_seq_color(default=‘black’), andjmd_seq_color(default=‘white’):# Change default TMD-JMD colors cpp_plot.feature_map(df_feat=df_feat, **args_seq, tmd_color="orange", jmd_color="white", tmd_seq_color="blue", jmd_seq_color="blue") plt.show()
The fontsize of the sequence is optimized automatically. Set
verbose=Trueto see the optimized size. You can set it manually using theseq_sizeparameter:# Change sequence size manually cpp_plot.feature_map(df_feat=df_feat, **args_seq, seq_size=8) plt.show()
This might result in suboptimal spacing among sequence characters. Adjust the font size of the part labels (‘JMD-N’, ‘TMD’, ‘JMD-C’) using
fontsize_tmd_jmd, which is set by default to the optimized sequence size. Change its weight usingweight_tmd_jmd(default=‘normal’)# Adjust the fontsize of the TMD-JMD characters cpp_plot.feature_map(df_feat=df_feat, **args_seq, fontsize_tmd_jmd=16, weight_tmd_jmd="bold") plt.show()
Display the xtick positions in addition to the sequence by setting
add_xticks_pos=True(default=False):# Add the xticks indicating the sequence positions cpp_plot.feature_map(df_feat=df_feat, **args_seq, add_xticks_pos=True) plt.show()
CPP Analysis
Use
fontsize_labels(default=12) to change the fontsize of the scale category legend and the colorbar:# Modify label size of legends and colorbar cpp_plot.feature_map(df_feat=df_feat, fontsize_labels=14) plt.show()
Change the fontsize of the titles (feature information on upper part of feature map) using
fontsize_titles(default=11):# Modify fontsize feature titles cpp_plot.feature_map(df_feat=df_feat, fontsize_titles=14) plt.show()
The fontsize of the feature importance percentages (excluding color bar) can be changed using the
fontsize_annotations(default=11) parameter:# Modify fontsize feature importance annotations cpp_plot.feature_map(df_feat=df_feat, fontsize_annotations=14) plt.show()
Adjust the feature map grid using the
grid_linewidth(default=0.01) andgrid_linecolor(set by default based onfacecolor_dark) parameters:# Adjust feature map grid cpp_plot.feature_map(df_feat=df_feat, grid_linewidth=1, grid_linecolor="orange") plt.show()
The TMD part borders are highlighted by an extra line, which width can be customized by
border_linewidth(default=2):# Increase width of TMD border cpp_plot.feature_map(df_feat=df_feat, border_linewidth=5) plt.show()
The background is set automatically basd on
shap_plot. You can set it to black byfacecolor_dark=True:# Set background to black cpp_plot.feature_map(df_feat=df_feat, facecolor_dark=True) plt.show()
Adjust the lower and upper end of the colormap using the
vminandvmaxparameters:# Change minimum and maximum values cpp_plot.feature_map(df_feat=df_feat, vmin=-10, vmax=20) plt.show()
You can provide any colormap from Matplotlib Colormaps using the
cmapparameter. The number of discrete steps can be adjusted bycmap_n_colors(default=101):# Use matplotlib color map with 7 color steps cpp_plot.feature_map(df_feat=df_feat, cmap="viridis", cmap_n_colors=7) plt.show()
Customize the colorbar using
cbar_kws. You can adjust its position (x-axis, y-axis), width, and height bycbar_xywh(default=(0.7, None, 0.2, None)), where default values are adopted ifNoneis provided. The position of the feature importance legend is set by thelegend_imp_xyparameter relative to the color bar:# Change colorbar title, position, width and height cbar_kws = dict(orientation="vertical") fig, ax = cpp_plot.feature_map(df_feat=df_feat, cbar_kws=cbar_kws, cbar_xywh=(0.88, 0.25, 0.01, 0.5), legend_imp_xy=(1, -0.3)) # Plot must be adjusted by plt.subplots_adjust and not by plt.tight_layout plt.subplots_adjust(right=0.84) plt.show()
Change the thresholds of the feature importance to be highlighted using the
imp_ths(default=(0.2, 0.5, 1). The respective marker size can be adjusted using theimp_marker_sizes(default=(3, 5, 8)) parameter:# Change threshold for highlighting feature importance cpp_plot.feature_map(df_feat=df_feat, imp_ths=(0.2, 1, 2), imp_marker_sizes=(2, 6, 10)) plt.show()
Adjust the scale legend by the
legend_kwsparameter and its position usinglegend_xy(default=(-0.1, -0.01)):# Adjust legend, colors can be changed by 'dict_color' legend_kws = dict(fontsize=13, fontsize_title=15, weight_title="bold") cpp_plot.feature_map(df_feat=df_feat, legend_kws=legend_kws, legend_xy=(None, 0.05)) plt.show()
Following x-tick parameters can be adjusted: xtick_size (default=11.0), xtick_width (default=2.0), and xtick_length (default=5.0):
# Adjust x-ticks cpp_plot.feature_map(df_feat=df_feat, xtick_size=16, xtick_width=5, xtick_length=10) plt.show()
X-ticks can be removed setting
xtick_size=0:# Remove x-ticks cpp_plot.feature_map(df_feat=df_feat, xtick_size=0) plt.show()
CPP Analysis (sample-level)
To visualize the sample-specific feature value difference, we create the feature matrix for the DOM_GSEC example dataset (see [Breimann25a]) using the
SequenceFeature().feature_matrix()method:# Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)
Next, we must include the feature impact into the df_feat for all samples using the
ShapModelmodel:# Include sample-specific feature value differences with reference set labels = df_seq["label"].to_list() sm = aa.ShapModel() df_feat = sm.add_sample_mean_dif(X, labels=labels, df_feat=df_feat) aa.display_df(df_feat, n_rows=5, n_cols=15, show_shape=True)
DataFrame shape: (150, 141)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std 1 TMD_C_JMD_C-Seg...,11)-LIFS790102 Conformation β-strand β-strand Conformational ...n-Sander, 1979) 0.189000 0.125674 0.125674 0.183876 0.218813 0.000001 0.000039 28,29 4.729200 4.776785 2 TMD_C_JMD_C-Seg...2,3)-CHOP780212 Conformation β-sheet (C-term) β-turn (1st residue) Frequency of th...-Fasman, 1978b) 0.199000 0.065983 -0.065983 0.087814 0.105835 0.000000 0.000016 27,28,29,30,31,32,33 4.106000 5.236574 3 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 4 TMD_C_JMD_C-Seg...2,3)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.211000 0.077355 0.077355 0.102965 0.107453 0.000000 0.000005 27,28,29,30,31,32,33 3.048800 3.623912 5 TMD_C_JMD_C-Pat...4,8)-JANJ790102 Energy Free energy (unfolding) Transfer free e...(TFE) to inside Transfer free e...y (Janin, 1979) 0.187000 0.144354 -0.144354 0.181777 0.233103 0.000001 0.000049 33,37 2.833600 3.640617 Finally, we can visualize the feature impact for a selected sample by providing the respective column name in
col_valand its sequence parameters together with settingshap_plot=True:# Plot CPP feature map for selected protein args_seq = dict(tmd_seq=tmd_seq, jmd_n_seq=jmd_n_seq, jmd_c_seq=jmd_c_seq) cpp_plot.feature_map(df_feat=df_feat, col_val="mean_dif_Protein0", name_test="Protein0", **args_seq) plt.show()
The sample-specific feature importance (absolute feature impact) can be obtained as follows:
# Fit SHAP explainer to obtain SHAP values sm.fit(X, labels=labels) # Include feature impact for all samples df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True) # Convert feature impact to sample-specific feature importance feat_imp_prot0 = df_feat["feat_impact_Protein0"].abs() df_feat.insert(loc=10, column='feat_importance_Protein0', value=feat_imp_prot0) # Plot CPP feature map for selected protein cpp_plot.feature_map(df_feat=df_feat, col_val="mean_dif_Protein0", col_imp="feat_importance_Protein0", name_test="Protein0", **args_seq) plt.show()