aaanalysis.CPPPlot.profile
- CPPPlot.profile(df_feat=None, shap_plot=False, col_imp='feat_importance', normalize=True, ax=None, figsize=(7, 5), start=1, tmd_len=20, tmd_seq=None, jmd_n_seq=None, jmd_c_seq=None, tmd_color='mediumspringgreen', jmd_color='blue', tmd_seq_color='black', jmd_seq_color='white', seq_size=None, fontsize_tmd_jmd=None, weight_tmd_jmd='normal', add_xticks_pos=False, highlight_tmd_area=True, highlight_alpha=0.15, add_legend_cat=False, dict_color=None, legend_kws=None, bar_width=0.75, edge_color=None, grid_axis=None, ylim=None, xtick_size=11.0, xtick_width=2.0, xtick_length=5.0, ytick_size=None, ytick_width=None, ytick_length=5.0)[source]
Plot CPP/-SHAP profile showing feature importance/impact per residue position.
- Parameters:
df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature. Must also include either
feat_importanceorfeat_impactcolumn.shap_plot (bool, default=False) –
Set the analysis type: CPP Analysis (if
False) for group-level or CPP-SHAP Analysis for sample-level (or subgroup-level) results:CPP Analysis
col_imp: Refers to the group-level feat_importance column (shown in gray), depicted by gray bars for each residue position.
CPP-SHAP Analysis
col_imp: Enables the selection of specific feature impacts from a feat_impact_’name’ column for an individual sample, where positive (red) and negative (blue) feature impacts are visualized by +/- bars.
col_imp (str or None, default='feat_importance') – Column name in
df_featfor feature importance/impact values. Must match with theshap_plotsetting. IfNone, the number of features per residue position will be shown.normalize (bool, default=True) – If
True, normalizes aggregated numerical values to a total of 100%.ax (plt.Axes, optional) – Pre-defined Axes object to plot on. If
None, a new Axes object is created.figsize (tuple, default=(7, 5)) – Figure dimensions (width, height) in inches.
start (int, default=1) – Position label of first residue position (starting at N-terminus).
tmd_len (int, default=20) – Length of TMD to be depicted (>0). Must match with all feature from
df_feat.tmd_seq (str, optional) – TMD sequence for specific sample.
jmd_n_seq (str, optional) – JMD N-terminal sequence for specific sample. Length must match with ‘jmd_n_len’ attribute.
jmd_c_seq (str, optional) – JMD C-terminal sequence for specific sample. Length must match with ‘jmd_c_len’ attribute.
tmd_color (str, default='mediumspringgreen') – Color for TMD.
jmd_color (str, default='blue') – Color for JMDs.
tmd_seq_color (str, default='black') – Color for TMD sequence.
jmd_seq_color (str, default='white') – Color for JMD sequence.
seq_size (int or float, optional) – Font size (>=0) for sequence characters. If
None, optimized automatically.fontsize_tmd_jmd (int or float, optional) – Font size (>=0) for the part labels: ‘JMD-N’, ‘TMD’, ‘JMD-C’. If
None, optimized automatically.weight_tmd_jmd ({'normal', 'bold'}, default='normal') – Font weight for the part labels: ‘JMD-N’, ‘TMD’, ‘JMD-C’.
add_xticks_pos (bool, default=False) – If
True, include x-tick positions when TMD-JMD sequence is given.highlight_tmd_area (bool, default=True) – If
True, highlights the TMD area on the plot.highlight_alpha (float, default=0.15) – The transparency alpha value [0-1] for TMD area highlighting.
add_legend_cat (bool, default=False) – If
True, the scale categories are indicated as stacked bars and a legend is added. IfTrue, ensure thatshap_plot=False.dict_color (dict, optional) – Color dictionary of scale categories for legend. Default from
plot_get_cdict()withname='DICT_CAT'.legend_kws (dict, optional) – Keyword arguments for the legend passed to
plot_legend().edge_color (str, optional) – Color of the bar edges.
grid_axis ({'x', 'y', 'both', None}, default=None) – Axis on which the grid is drawn if not
None.ylim (tuple, optional) – Y-axis limits. If
None, y-axis limits are set automatically.xtick_size (int or float, default=11.0) – Size of x-tick labels (>0).
xtick_width (int or float, default=2.0) – Width of the x-ticks (>0).
xtick_length (int or float, default=5.0) – Length of the x-ticks (>0).
ytick_size (int or float, optional) – Size of y-tick labels (>0).
ytick_width (int or float, default=2.0) – Width of the y-ticks (>0).
ytick_length (int or float, default=5.0) – Length of the y-ticks (>0).
- Returns:
fig (plt.Figure) – The Figure object for the CPP profile plot.
ax (plt.Axes) – CPP profile plot axes object.
Notes
tmd_seq_colorandjmd_seq_colorare applicable only whentmd_seq,jmd_n_seq, andjmd_c_seqare provided.
Warning
If
ylimdoes not match with minimum and/or maximum of aggregate numerical values across all residue position, aUserWarningis raised andylimwill be adjusted automatically.
Examples
To demonstrate the
CPPPlot().profile()method, we first load the exampleDOM_GSECdataset and its respective features (see [Breimann25a]):import matplotlib.pyplot as plt import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC") df_feat = aa.load_features(name="DOM_GSEC") df_feat = df_feat.sort_values(by="feat_importance", ascending=False).reset_index(drop=True) aa.display_df(df_feat, show_shape=True, n_rows=7)
DataFrame shape: (150, 15)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std 1 TMD_C_JMD_C-Seg...,11)-LIFS790102 Conformation β-strand β-strand Conformational ...n-Sander, 1979) 0.189000 0.125674 0.125674 0.183876 0.218813 0.000001 0.000039 28,29 4.729200 4.776785 2 TMD_C_JMD_C-Seg...2,3)-CHOP780212 Conformation β-sheet (C-term) β-turn (1st residue) Frequency of th...-Fasman, 1978b) 0.199000 0.065983 -0.065983 0.087814 0.105835 0.000000 0.000016 27,28,29,30,31,32,33 4.106000 5.236574 3 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 4 TMD_C_JMD_C-Seg...2,3)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.211000 0.077355 0.077355 0.102965 0.107453 0.000000 0.000005 27,28,29,30,31,32,33 3.048800 3.623912 5 TMD_C_JMD_C-Pat...4,8)-JANJ790102 Energy Free energy (unfolding) Transfer free e...(TFE) to inside Transfer free e...y (Janin, 1979) 0.187000 0.144354 -0.144354 0.181777 0.233103 0.000001 0.000049 33,37 2.833600 3.640617 6 TMD_C_JMD_C-Pat...4,8)-KANM800103 Conformation α-helix α-helix Average relativ...sa-Tsong, 1980) 0.176000 0.087846 0.087846 0.140464 0.157561 0.000004 0.000113 24,28 2.704000 4.076269 7 TMD_C_JMD_C-Pat...,10)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.149000 0.073526 0.073526 0.133612 0.157088 0.000090 0.000714 31,34,38 2.050800 2.338278 CPP Analysis
The group-level feature impact per residue position can be visualized by providing the
df_featDataFrame:# Plot CPP profile at group-level cpp_plot = aa.CPPPlot() aa.plot_settings() cpp_plot.profile(df_feat=df_feat) plt.tight_layout() plt.show()
You can show the number of features per residue position instead of the feature importance by setting
col_imp=None(default=‘feat_importance’):# Show number of features per position cpp_plot.profile(df_feat=df_feat, col_imp=None) plt.tight_layout() plt.show()
The feature importance displayed is normalized by default, meaning that all values sum up to a total of 100%. You can turn off this normalization by setting
normalize=False(default=‘True’), useful when showing a feature subset. We set a similarylimto keep the results comparable:# Show top 10 features df_top10 = df_feat.head(10) cpp_plot.profile(df_feat=df_top10, normalize=False, ylim=(0, 11)) plt.tight_layout() plt.show()
You can adjust the
figsize(default=(7, 5)),start(default=1) position, andtmd_len(default=20) as follows:# Increase width of figure, start at 11 position and double tmd length cpp_plot.profile(df_feat=df_feat, figsize=(8, 4), start=11, tmd_len=40) plt.tight_layout() plt.show()
CPP Analysis (sample-level)
You can visualize how the general feature importance is translated onto the sequence of a specific sample. To this end, you need to provide the corresponding sequence parameters:
jmd_n_seq,tmd_seq, andjmd_c_seq:# Get sequence parts of first sample jmd_n_seq, tmd_seq, jmd_c_seq = df_seq.loc[0, ["jmd_n", "tmd", "jmd_c"]] args_seq = dict(jmd_n_seq=jmd_n_seq, tmd_seq=tmd_seq, jmd_c_seq=jmd_c_seq) print("Sequence parts of first sample") print(jmd_n_seq, tmd_seq, jmd_c_seq) # Plot CPP profile for first sample cpp_plot.profile(df_feat=df_feat, **args_seq) plt.show()Sequence parts of first sample FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
You can customize the following color parameters:
tmd_color(default=‘mediumspringgreen’),jmd_color(default=‘blue’),tmd_seq_color(default=‘black’), andjmd_seq_color(default=‘white’):# Change default TMD-JMD colors cpp_plot.profile(df_feat=df_feat, **args_seq, tmd_color="orange", jmd_color="white", tmd_seq_color="blue", jmd_seq_color="blue") plt.show()
The fontsize of the sequence is optimized automatically. Set
verbose=Trueto see the optimized size. You can set it manually using theseq_sizeparameter:# Change sequence size manually cpp_plot.profile(df_feat=df_feat, **args_seq, seq_size=8) plt.show()
However, this can lead to lead to non-optimal spacing between the sequence characters. Adjust the font size of the part labels (‘JMD-N’, ‘TMD’, ‘JMD-C’) by setting
fontsize_tmd_jmd, which is by default set to the optimized sequence size:cpp_plot.profile(df_feat=df_feat, **args_seq, fontsize_tmd_jmd=11) plt.show()
The part labels can only be changed globally using
optionsas follows:# Adjust part names globally aa.options["name_jmd_n"] = "Part 1" aa.options["name_tmd"] = "Part 2" aa.options["name_jmd_c"] = "Part 3" cpp_plot.profile(df_feat=df_feat, **args_seq) plt.show()
You can focus on only the Target Middle Domain (TMD) by setting the size of the JMDs to 0. Disable the highlight of the TMD by setting
highlight_tmd_area=False(default=True):# Show only features from TMD and JMD-N cpp_plot = aa.CPPPlot(jmd_n_len=0, jmd_c_len=10) mask = ~df_feat["feature"].str.contains("JMD_N") cpp_plot.profile(df_feat=df_feat[mask], tmd_seq=tmd_seq, jmd_c_seq=jmd_c_seq, highlight_tmd_area=False) plt.show()
Display the xtick positions in addition to the sequence by setting
add_xticks_pos=True(default=False):# Set parts back to default aa.options["name_jmd_n"] = "JMD-N" aa.options["name_tmd"] = "TMD" aa.options["name_jmd_c"] = "JMD-C" cpp_plot = aa.CPPPlot() cpp_plot.profile(df_feat=df_feat, **args_seq, add_xticks_pos=True) plt.show()
Change the transparency of the TMD highlighting area using the
highlight_alpha(default=0.15) parameter:# Change transparency of TMD area cpp_plot.profile(df_feat=df_feat, **args_seq, highlight_alpha=0.75) plt.show()
CPP Analysis
You can represent the scale category for each feature at each residue position as a stacked bar chart. To do this, set
add_legend_cat, which also automatically includes the corresponding legend:# Add scale classification cpp_plot.profile(df_feat=df_feat, add_legend_cat=True) plt.show()
Adjust the legend by the
legend_kwsparameter:# Adjust legend, colors can be changed by 'dict_color' legend_kws = dict(n_cols=1, fontsize=13, fontsize_title=13) cpp_plot.profile(df_feat=df_feat, add_legend_cat=True, legend_kws=legend_kws) plt.show()
Adjust the
bar_width(default=0.75) and the baredge_color(dfault=None) as follows:# Make edges black cpp_plot.profile(df_feat=df_feat, bar_width=0.5, edge_color="black") plt.show()
Show
grid_axis(default=None, disabled) and setylim, as exemplified here:# Adjust ylim cpp_plot.profile(df_feat=df_feat, grid_axis="y", ylim=(0, 15)) plt.show()
Following x-tick parameters can be adjusted:
xtick_size(default=11.0),xtick_width(default=2.0),xtick_length(default=5.0):# Adjust x-ticks cpp_plot.profile(df_feat=df_feat, xtick_size=20, xtick_width=4, xtick_length=10) plt.show()
Or the following y-tick parameters:
ytick_size(adheres to global settings),ytick_width(default=2.0), andytick_length(default=5.0):# Modify y-ticks cpp_plot.profile(df_feat=df_feat, ytick_size=20, ytick_width=4, ytick_length=10) plt.show()
CPP-SHAP analysis
Set
shap_plot=Truefor visualizing the sample-specific feature impact instead of the overall feature importance. To demonstrate this, we create the feature matrix for the DOM_GSEC example dataset (see [Breimann25a]) using theSequenceFeature().feature_matrix()method:# Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)
Next, we must include the feature impact into the
df_featfor all samples using theShapModelmodel:labels = df_seq["label"].to_list() # Fit SHAP explainer to obtain SHAP values sm = aa.ShapModel() sm.fit(X, labels=labels) # Include feature impact for all samples df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True) aa.display_df(df_feat, n_rows=5, n_cols=15)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_impact_Protein0 feat_impact_Protein1 1 TMD_C_JMD_C-Seg...,11)-LIFS790102 Conformation β-strand β-strand Conformational ...n-Sander, 1979) 0.189000 0.125674 0.125674 0.183876 0.218813 0.000001 0.000039 28,29 2.790000 2.820000 2 TMD_C_JMD_C-Seg...2,3)-CHOP780212 Conformation β-sheet (C-term) β-turn (1st residue) Frequency of th...-Fasman, 1978b) 0.199000 0.065983 -0.065983 0.087814 0.105835 0.000000 0.000016 27,28,29,30,31,32,33 3.300000 3.780000 3 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 1.930000 2.380000 4 TMD_C_JMD_C-Seg...2,3)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.211000 0.077355 0.077355 0.102965 0.107453 0.000000 0.000005 27,28,29,30,31,32,33 3.030000 1.430000 5 TMD_C_JMD_C-Pat...4,8)-JANJ790102 Energy Free energy (unfolding) Transfer free e...(TFE) to inside Transfer free e...y (Janin, 1979) 0.187000 0.144354 -0.144354 0.181777 0.233103 0.000001 0.000049 33,37 1.130000 1.350000 Finally, we can visualize the feature impact for a selected sample by providing the respective column name in
col_impand its sequence parameters together with settingshap_plot=True:# Plot CPP-SHAP profile for selected protein cpp_plot.profile(df_feat=df_feat, shap_plot=True, col_imp="feat_impact_Protein0", tmd_seq=tmd_seq, jmd_n_seq=jmd_n_seq, jmd_c_seq=jmd_c_seq) plt.show()
We recommend adjusting
ylimto ensure that 0 is centered in the middle of the y-axis:# Center y-axis cpp_plot.profile(df_feat=df_feat, shap_plot=True, col_imp="feat_impact_Protein0", tmd_seq=tmd_seq, jmd_n_seq=jmd_n_seq, jmd_c_seq=jmd_c_seq, ylim=(-15, 15)) plt.show()