ShapModel.add_feat_impact
- ShapModel.add_feat_impact(df_feat, drop=False, samples=None, names=None, normalize=True, group_average=False, shap_feat_importance=False, df_seq=None, sample_positions=None)[source]
Compute SHapley Additive exPlanations (SHAP) feature impact (or importance) from SHAP values and add to the feature DataFrame.
Three different scenarios for computing the feature impact are possible:
Single sample: Computes the feature impact for a selected sample.
Multiple samples: Computes the feature impact for multiple samples (all by default).
Group of samples: Computes the average feature impact for a group of samples (+ standard deviation).
The calculated feature impacts are added to
df_featas new columns namedfeat_impact_'name(s)', corresponding to each sample or group. Additionally, the SHAP value-based feature importance can be included asfeat_importancecolumn.Added in version 0.1.0.
- Parameters:
df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
drop (bool, default=False) – If
True, allow dropping of already existing feature impact and feature importance columns fromdf_featbefore inserting.samples (int, list of int, str, list of str, or None) – Sample(s) to compute the feature impact for, given either as row position(s) in
shap_valuesor as entry name(s) (str) from theentrycolumn ofdf_seq(resolved to the matching row(s)). IfNone, the impact for each sample will be returned.names (str or list of str, optional) –
Unique name(s) used for the feature impact columns. When provided, they should align with
samplesas follows:Single sample:
nameas string andsamplesas integer or entry name.Multiple samples:
nameas list of string andsamplesas corresponding list.Group:
nameas string andsamplesas list, each indicating a group sample.
If
samplesisNone(all samples are considered),namemust be list with names for each sample. Whensamplesis given as entry name(s) andnamesisNone, the entry name(s) are used.normalize (bool, default=True) – If
True, normalize the feature impact to percentage.group_average (bool, default=False) – If
True, compute the average of samples given bysamples.shap_feat_importance (bool, default=False) – If
True, include feature importance (i.e., absolute average SHAP values) instead of impact todf_feat.df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an
entrycolumn with a unique protein identifier per row, row-aligned to the fitted samples. Required only whensamplesis given as entry name(s).sample_positions (int, list of int, str, list of str, or None) – Deprecated alias for
samples(removed in 1.2.0).
- Returns:
df_feat – Feature DataFrame including feature impact. If the feature impact is computed for multiple samples, n=number of samples; n=1, otherwise.
- Return type:
pd.DataFrame, shape (n_features, n_feature_info+n)
Notes
Feature impact (sample-level): The feature impact quantifies the positive or negative contribution of a feature to increase or decrease the model output for a specific sample (typically, prediction score). For each sample, the impact of an individual feature is represented by its corresponding SHAP value. These values are normalized such that the sum of their absolute values equals 100%.
Feature impact (group-level): The feature impact calculated for individual samples can be averaged to determine the feature impact for a group. This reflects how features influence the model’s output on average within that group.
Feature importance (SHAP value-based): The average of the feature impact across all samples is termed as shap value-based ‘feature importance’. This quantifies the overall contribution of each feature across the entire dataset.
Warning
If
group_average=True, warning when the standard deviation of a feature’s impact significantly exceeds its mean impact, this may indicate an unreliable grouping.
See also
ShapModel.add_sample_mean_dif(): for computing the raw feature value difference between samples and a reference group.
Examples
To demonstrate the
ShapModel().add_feat_impact()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC", n=3) labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(5) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts) aa.display_df(df_seq)
/Users/stephanbreimann/Programming/1Packages/aaanalysis-shap-acc/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel. warnings.warn(
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER We can now create a
ShapModelobject and fit it to create theshap_values, which are saved internally:sm = aa.ShapModel() sm.fit(X, labels=labels) shap_values = sm.shap_values # Print SHAP values and expected value print("SHAP values explain the feature impact for 3 negative and 3 positive samples") print(shap_values.round(2))
SHAP values explain the feature impact for 3 negative and 3 positive samples [[-0.1 -0.09 -0.09 -0.09 -0.07] [-0.12 -0.11 -0.09 -0.1 -0.08] [-0.14 -0.14 -0.03 -0.09 -0.03] [ 0.13 0.12 0.06 0.09 0.03] [ 0.12 0.11 0.08 0.09 0.08] [ 0.12 0.11 0.08 0.1 0.07]]
We can now include the feature impact (i.e., SHAP values normalized such that their absolute values sum up to 100%) by providing
df_featto theShapModel().add_feat_impact()method:# Add feature impact of each sample (Protein0 to Protein5) df_feat = sm.add_feat_impact(df_feat=df_feat) aa.display_df(df_feat)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std feat_impact_Protein0 feat_impact_Protein1 feat_impact_Protein2 feat_impact_Protein3 feat_impact_Protein4 feat_impact_Protein5 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 -21.500000 -24.310000 -32.990000 29.290000 24.670000 25.240000 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 -20.770000 -23.100000 -31.680000 28.370000 23.240000 23.770000 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 -20.860000 -17.740000 -7.870000 12.850000 17.160000 17.480000 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 -20.850000 -19.670000 -21.570000 21.510000 19.320000 19.710000 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 -16.030000 -15.190000 -5.890000 7.970000 15.610000 13.790000 To include the impact of a specific sample, use the
samplesparameter indicating the position index of the sample within theshap_valuesattribute (the same as in thelabelsprovided to theShapModel().fit()method). You need to setdrop=Trueto override the feature impact columns:# First protein df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=0) aa.display_df(df_feat, n_cols=-1)
feat_impact_Protein0 1 -21.500000 2 -20.770000 3 -20.860000 4 -20.850000 5 -16.030000 You can provide a specific
namesfor the corresponding sample:# Single sample df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=0, names="Selected_sample") aa.display_df(df_feat, n_cols=-1)
feat_impact_Selected_sample 1 -21.500000 2 -20.770000 3 -20.860000 4 -20.850000 5 -16.030000 Samples can also be selected by accession instead of position. Provide
df_seqand pass entry name(s) tosamples; they are resolved to the matching row(s), and the impact column is named after the accession unlessnamesis given:# Select a sample by its entry/accession via df_seq entry = df_seq["entry"].iloc[0] df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, df_seq=df_seq, samples=entry) aa.display_df(df_feat, n_cols=-1)
feat_impact_Q14802 1 -21.500000 2 -20.770000 3 -20.860000 4 -20.850000 5 -16.030000 Computing feature impact
Three different scenarios are possible:
Single sample: Compute the feature impact for a single sample (above).
Multiple samples: Compute the feature impact for multiple samples (all by default).
Group of samples: Compute the average feature impact and standard deviation for a group.
To focus on specific samples, specify their indices in
samples. Ifnamesis provided, its length should matchsamples.# Multiple samples df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=[0, 1], names=["Sample 1", "Sample 2"]) aa.display_df(df_feat, n_cols=-2)
feat_impact_Sample 1 feat_impact_Sample 2 1 -21.500000 -24.310000 2 -20.770000 -23.100000 3 -20.860000 -17.740000 4 -20.850000 -19.670000 5 -16.030000 -15.190000 To calculate the group average, set
group_average=Trueand specify the sample indices insample_positions. Provide anamesfor the group, or ‘Group’ will be used by default:# Group of samples df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=[0, 1], group_average=True) aa.display_df(df_feat, n_cols=-2)
feat_impact_Group feat_impact_std_Group 1 -22.960000 2.414326 2 -21.980000 2.132446 3 -19.230000 0.709827 4 -20.230000 0.305787 5 -15.590000 0.265505 Setting
shap_feat_importance=True, will compute the SHAP value-based feature importance:# SHAP value-based feature importance df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, shap_feat_importance=True) aa.display_df(df_feat, n_cols=-1)
feat_importance 1 26.180000 2 25.000000 3 15.830000 4 20.390000 5 12.610000