ShapModel: Explaining with single-residue resolution

To enable explanations of sample-specific predictions at the single-residue level, we’ve developed a wrapper model around SHAP (SHapley Additive exPlanations). This explainable AI framework adopts a game-theoretic approach to elucidate the output of any machine learning model. Our ShapModel model, introduced in [Breimann25a], fits multiple SHAP explainers and integrates their output to provide a robust estimate of feature impacts.

Obtaining Feature Impact

To illustrate, we’ll use an example dataset comprising γ-secretase substrates (n=63) and non-substrates (n=63) from [Breimann25a]. We’ll explain the model’s prediction output for the Alzheimer’s disease-associated amyloid precursor protein (APP):

import aaanalysis as aa
aa.options["verbose"] = False
aa.options["random_state"] = 42

# Load dataset and respective features
df_seq = aa.load_dataset(name="DOM_GSEC")
labels = list(df_seq["label"])
df_feat = aa.load_features(name="DOM_GSEC")

# Show APP
aa.display_df(df=df_seq, n_rows=10, char_limit=25)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 P05067 MLPGLALLLLAA...NPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
2 P14925 MAGRARSGLLLL...YSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
3 P70180 MRSLLLFTFSAC...REDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER
4 Q03157 MGPTSPAARGQG...ENPTYRFLEERP 1 585 607 APSGTGVSRE ALSGLLIMGAGGGSLIVLSLLLL RKKKPYGTIS
5 Q06481 MAATGTAAAAAT...NPTYKYLEQMQI 1 694 716 LREDFSLSSS ALIGLLVIAVAIATVIVISLVML RKRQYGTISH
6 P35613 MAAALFVLLGFA...DKGKNVRQRNSS 1 323 345 IITLRVRSHL AALWPFLGIVAEVLVLVTIIFIY EKRRKPEDVL
7 P35070 MDRAARCSGASS...PINEDIEETNIA 1 119 141 LFYLRGDRGQ ILVICLIAVMVVFIILVIGVCTC CHPLRKRRKR
8 P09803 MGARCRSFSALL...KLADMYGGGEDD 1 711 733 GIVAAGLQVP AILGILGGILALLILILLLLLFL RRRTVVKEPL
9 P19022 MCRIAGALRTLL...KKLADMYGGGDD 1 724 746 RIVGAGLGTG AIIAILLCIIILLILVLMFVVWM KRRDKERQAK
10 P16070 MDKFWWHAAWGL...RNLQNVDMKIGV 1 650 672 GPIRTPQIPE WLIILASLLALALILAVCIAVNS RRRCGQKKKL

To fit the ShapModel, we fist need to create the feature matrix using the SequenceFeature.feat_matrix() method:

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(df_parts=df_parts, features=df_feat["feature"])

The difference of feature values between a selected protein (e.g., APP) and the reference dataset can be included into the feature DataFrame using the ShapModel.add_sample_mean_dif() method. This will create a new mean_dif_'name' column (e.g., ‘mean_dif_APP’):

se = aa.ShapModel()
se.fit(X, labels=labels)

# Add the feature value difference for the first protein (APP)
df_feat = se.add_sample_mean_dif(X, labels=labels, df_feat=df_feat,
                                 sample_positions=0, names="APP")

To add the sample-specific feature impact, use the ShapModel.add_feat_impact() method. This will create a new feat_impact_'name' column (e.g., ‘feat_impact_APP’):

# Add feature impacts for the first protein (APP)
df_feat = se.add_feat_impact(df_feat=df_feat, sample_positions=0, names="APP")
aa.display_df(df=df_feat, n_rows=5)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std mean_dif_APP feat_impact_APP
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 0.220635 0.980000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 0.193990 1.320000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 0.283275 1.710000
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 0.162838 2.870000
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 0.189680 1.140000

Visualizing Feature Impact

Features can have either a positive (red) or negative (blue) impact, indicating whether they increase or decrease the model’s prediction output, respectively. Before visualizing impact of features, we need to obtain its sequence parts:

import matplotlib.pyplot as plt

# Get sequences parts for APP
_df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=["tmd", "jmd_c", "jmd_n"])
_args_seq = _df_parts.loc["P05067"].to_dict()   # Accession number of APP
args_seq = {key + "_seq": _args_seq[key] for key in _args_seq}

To explain the feature impact with single-residue resolution, AAanalysis offers the following four types of visualizations: CPP-SHAP ranking plot, CPP-SHAP profile, CPP heatmap, CPP-SHAP heatmap. An overview is provided under Explainable AI Usage Principles

We can first show the feature ranking for a selected protein (‘Protein0’) using the CPPPlot.ranking() method, plotting SHAP analysis results by setting shap_plot=True:

# CPP-SHAP ranking plot
cpp_plot = aa.CPPPlot()
aa.plot_settings(short_ticks=True, weight_bold=False)
cpp_plot.ranking(df_feat=df_feat, shap_plot=True,
                 col_dif="mean_dif_APP", col_imp="feat_impact_APP",
                 name_test="APP")
plt.tight_layout()
plt.show()
../_images/tutorial5a_shap_model_1_output_11_0.png

Show the specific CPP-SHAP Profile for the first Protein using the CPPPlot.profile() method with setting shap_plot=True:

# CPP-SHAP profile
aa.plot_settings(font_scale=0.9)
cpp_plot.profile(df_feat=df_feat, shap_plot=True,
                 col_imp="feat_impact_APP", **args_seq)
plt.title("CPP-SHAP profile for APP")
plt.tight_layout()
plt.show()
../_images/tutorial5a_shap_model_2_output_13_0.png

With the CPPPlot.heatmap() method we can visualize the sample-specific feature value difference and the feature impact per scale subcategory and residue position. Set shap_plot=True and provide the respective column with the mean difference and the feature impact:

# CPP heatmap (sample level)
fs = aa.plot_gcfs()
aa.plot_settings(font_scale=0.65, weight_bold=False)
cpp_plot.heatmap(df_feat=df_feat, shap_plot=True,
                 col_val="mean_dif_APP", **args_seq, name_test="APP")
plt.title("CPP heatmap for APP", fontsize=fs+5, weight="bold")
plt.show()

# CPP-SHAP heatmap (sample level)
cpp_plot.heatmap(df_feat=df_feat, shap_plot=True,
                 col_val="feat_impact_APP", **args_seq)
plt.title("CPP-SHAP heatmap for APP", fontsize=fs+5, weight="bold")
plt.show()
../_images/tutorial5a_shap_model_3_output_15_0.png ../_images/tutorial5a_shap_model_4_output_15_1.png

You can create as well the feature map for individual samples. We first have the convert the feature impact into feature importance by simply taking their absolute values.

# Add feature importance
df_feat.insert(len(df_feat.T), "feat_importance_APP", [abs(x) for x in df_feat["feat_impact_APP"]])

The feature map can be created using the CPPPlot.feature_map() method with providing the respective sequence parts and data columns:

# CPP feature map (sample level)
aa.plot_settings(font_scale=0.65, weight_bold=False)
cpp_plot.feature_map(df_feat=df_feat, col_val="mean_dif_APP", col_imp="feat_importance_APP", **args_seq)
plt.show()
../_images/tutorial5a_shap_model_5_output_19_0.png

Details on the ShapModel class can be found in the ShapModel API. More information on explainable AI and how CPP and SHAP were combined are provided in the Explainable AI Usage Principles section.