ShapModel.add_feat_impact

ShapModel.add_feat_impact(df_feat, drop=False, samples=None, names=None, normalize=True, group_average=False, shap_feat_importance=False, df_seq=None, sample_positions=None)[source]

Compute SHapley Additive exPlanations (SHAP) feature impact (or importance) from SHAP values and add to the feature DataFrame.

Three different scenarios for computing the feature impact are possible:

  1. Single sample: Computes the feature impact for a selected sample.

  2. Multiple samples: Computes the feature impact for multiple samples (all by default).

  3. Group of samples: Computes the average feature impact for a group of samples (+ standard deviation).

The calculated feature impacts are added to df_feat as new columns named feat_impact_'name(s)', corresponding to each sample or group. Additionally, the SHAP value-based feature importance can be included as feat_importance column.

Added in version 0.1.0.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • drop (bool, default=False) – If True, allow dropping of already existing feature impact and feature importance columns from df_feat before inserting.

  • samples (int, list of int, str, list of str, or None) – Sample(s) to compute the feature impact for, given either as row position(s) in shap_values or as entry name(s) (str) from the entry column of df_seq (resolved to the matching row(s)). If None, the impact for each sample will be returned.

  • names (str or list of str, optional) –

    Unique name(s) used for the feature impact columns. When provided, they should align with samples as follows:

    • Single sample: name as string and samples as integer or entry name.

    • Multiple samples: name as list of string and samples as corresponding list.

    • Group: name as string and samples as list, each indicating a group sample.

    If samples is None (all samples are considered), name must be list with names for each sample. When samples is given as entry name(s) and names is None, the entry name(s) are used.

  • normalize (bool, default=True) – If True, normalize the feature impact to percentage.

  • group_average (bool, default=False) – If True, compute the average of samples given by samples.

  • shap_feat_importance (bool, default=False) – If True, include feature importance (i.e., absolute average SHAP values) instead of impact to df_feat.

  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with a unique protein identifier per row, row-aligned to the fitted samples. Required only when samples is given as entry name(s).

  • sample_positions (int, list of int, str, list of str, or None) – Deprecated alias for samples (removed in 1.2.0).

Returns:

df_feat – Feature DataFrame including feature impact. If the feature impact is computed for multiple samples, n=number of samples; n=1, otherwise.

Return type:

pd.DataFrame, shape (n_features, n_feature_info+n)

Notes

Feature impact (sample-level): The feature impact quantifies the positive or negative contribution of a feature to increase or decrease the model output for a specific sample (typically, prediction score). For each sample, the impact of an individual feature is represented by its corresponding SHAP value. These values are normalized such that the sum of their absolute values equals 100%.

Feature impact (group-level): The feature impact calculated for individual samples can be averaged to determine the feature impact for a group. This reflects how features influence the model’s output on average within that group.

Feature importance (SHAP value-based): The average of the feature impact across all samples is termed as shap value-based ‘feature importance’. This quantifies the overall contribution of each feature across the entire dataset.

Warning

  • If group_average=True, warning when the standard deviation of a feature’s impact significantly exceeds its mean impact, this may indicate an unreliable grouping.

See also

Examples

To demonstrate the ShapModel().add_feat_impact() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)
/Users/stephanbreimann/Programming/1Packages/aaanalysis-shap-acc/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER

We can now create a ShapModel object and fit it to create the shap_values, which are saved internally:

sm = aa.ShapModel()
sm.fit(X, labels=labels)

shap_values = sm.shap_values

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))
SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.1  -0.09 -0.09 -0.09 -0.07]
 [-0.12 -0.11 -0.09 -0.1  -0.08]
 [-0.14 -0.14 -0.03 -0.09 -0.03]
 [ 0.13  0.12  0.06  0.09  0.03]
 [ 0.12  0.11  0.08  0.09  0.08]
 [ 0.12  0.11  0.08  0.1   0.07]]

We can now include the feature impact (i.e., SHAP values normalized such that their absolute values sum up to 100%) by providing df_feat to the ShapModel().add_feat_impact() method:

# Add feature impact of each sample (Protein0 to Protein5)
df_feat = sm.add_feat_impact(df_feat=df_feat)
aa.display_df(df_feat)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std feat_impact_Protein0 feat_impact_Protein1 feat_impact_Protein2 feat_impact_Protein3 feat_impact_Protein4 feat_impact_Protein5
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 -21.500000 -24.310000 -32.990000 29.290000 24.670000 25.240000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 -20.770000 -23.100000 -31.680000 28.370000 23.240000 23.770000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 -20.860000 -17.740000 -7.870000 12.850000 17.160000 17.480000
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 -20.850000 -19.670000 -21.570000 21.510000 19.320000 19.710000
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 -16.030000 -15.190000 -5.890000 7.970000 15.610000 13.790000

To include the impact of a specific sample, use the samples parameter indicating the position index of the sample within the shap_values attribute (the same as in the labels provided to the ShapModel().fit() method). You need to set drop=True to override the feature impact columns:

# First protein
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=0)
aa.display_df(df_feat, n_cols=-1)
  feat_impact_Protein0
1 -21.500000
2 -20.770000
3 -20.860000
4 -20.850000
5 -16.030000

You can provide a specific names for the corresponding sample:

# Single sample
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)
  feat_impact_Selected_sample
1 -21.500000
2 -20.770000
3 -20.860000
4 -20.850000
5 -16.030000

Samples can also be selected by accession instead of position. Provide df_seq and pass entry name(s) to samples; they are resolved to the matching row(s), and the impact column is named after the accession unless names is given:

# Select a sample by its entry/accession via df_seq
entry = df_seq["entry"].iloc[0]
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, df_seq=df_seq, samples=entry)
aa.display_df(df_feat, n_cols=-1)
  feat_impact_Q14802
1 -21.500000
2 -20.770000
3 -20.860000
4 -20.850000
5 -16.030000

Computing feature impact

Three different scenarios are possible:

  1. Single sample: Compute the feature impact for a single sample (above).

  2. Multiple samples: Compute the feature impact for multiple samples (all by default).

  3. Group of samples: Compute the average feature impact and standard deviation for a group.

To focus on specific samples, specify their indices in samples. If names is provided, its length should match samples.

# Multiple samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)
  feat_impact_Sample 1 feat_impact_Sample 2
1 -21.500000 -24.310000
2 -20.770000 -23.100000
3 -20.860000 -17.740000
4 -20.850000 -19.670000
5 -16.030000 -15.190000

To calculate the group average, set group_average=True and specify the sample indices in sample_positions. Provide a names for the group, or ‘Group’ will be used by default:

# Group of samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-2)
  feat_impact_Group feat_impact_std_Group
1 -22.960000 2.414326
2 -21.980000 2.132446
3 -19.230000 0.709827
4 -20.230000 0.305787
5 -15.590000 0.265505

Setting shap_feat_importance=True, will compute the SHAP value-based feature importance:

# SHAP value-based feature importance
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, shap_feat_importance=True)
aa.display_df(df_feat, n_cols=-1)
  feat_importance
1 26.180000
2 25.000000
3 15.830000
4 20.390000
5 12.610000