aaanalysis.ShapModel.add_feat_impact

ShapModel.add_feat_impact(df_feat=None, drop=False, sample_positions=None, names=None, normalize=True, group_average=False, shap_feat_importance=False)[source]

Compute SHAP feature impact (or importance) from SHAP values and add to the feature DataFrame.

Three different scenarios for computing the feature impact are possible:

  1. Single sample: Computes the feature impact for a selected sample.

  2. Multiple samples: Computes the feature impact for multiple samples (all by default).

  3. Group of samples: Computes the average feature impact for a group of samples (+ standard deviation).

The calculated feature impacts are added to df_feat as new columns named feat_impact_'name(s)', corresponding to each sample or group. Additionally, the SHAP value-based feature importance can be included as feat_importance column.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • drop (bool, default=False) – If True, allow dropping of already existing feature impact and feature importance columns from df_feat before inserting.

  • sample_positions (int, list of int, or None) – Position index/indices for the sample(s) in shap_values. If None, the impact for each sample will be returned.

  • names (str or list of str, optional) –

    Unique name(s) used for the feature impact columns. When provided, they should align with sample_positions as follows:

    • Single sample: name as string and sample_positions as integer.

    • Multiple samples: name as list of string and sample_positions as corresponding list of integers.

    • Group: name as string and sample_positions as list of integers, each indicating a group sample.

    If sample_positions is None (all samples are considered), name must be list with names for each sample.

  • normalize (bool, default=True) – If True, normalize the feature impact to percentage.

  • group_average (bool, default=False) – If True, compute the average of samples given by sample_positions.

  • shap_feat_importance (bool, default=False) – If True, include feature importance (i.e., absolute average SHAP values) instead of impact to df_feat.

Returns:

df_feat – Feature DataFrame including feature impact. If the feature impact is computed for multiple samples, n=number of samples; n=1, otherwise.

Return type:

pd.DataFrame, shape (n_features, n_feature_info+n)

Notes

Feature impact (sample-level): The feature impact quantifies the positive or negative contribution of a feature to increase or decrease the model output for a specific sample (typically, prediction score). For each sample, the impact of an individual feature is represented by its corresponding SHAP value. These values are normalized such that the sum of their absolute values equals 100%.

Feature impact (group-level): The feature impact calculated for individual samples can be averaged to determine the feature impact for a group. This reflects how features influence the model’s output on average within that group.

Feature importance (SHAP value-based): The average of the feature impact across all samples is termed as shap value-based ‘feature importance’. This quantifies the overall contribution of each feature across the entire dataset.

Warning

  • If group_average=True, warning when the standard deviation of a feature’s impact significantly exceeds its mean impact, this may indicate an unreliable grouping.

Examples

To demonstrate the ShapModel().add_feat_impact() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER

We can now create a ShapModel object and fit it to create the shap_values, which are saved internally:

sm = aa.ShapModel()
sm.fit(X, labels=labels)

shap_values = sm.shap_values

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))
SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.11 -0.1  -0.07 -0.1  -0.07]
 [-0.13 -0.12 -0.07 -0.09 -0.08]
 [-0.14 -0.13 -0.03 -0.09 -0.02]
 [ 0.13  0.13  0.04  0.09  0.04]
 [ 0.13  0.13  0.08  0.1   0.07]
 [ 0.13  0.13  0.08  0.09  0.06]]

We can now include the feature impact (i.e., SHAP values normalized such that their absolute values sum up to 100%) by providing df_feat to the ShapModel().add_feat_impact() method:

# Add feature impact of each sample (Protein0 to Protein5)
df_feat = sm.add_feat_impact(df_feat=df_feat)
aa.display_df(df_feat)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std feat_impact_Protein0 feat_impact_Protein1 feat_impact_Protein2 feat_impact_Protein3 feat_impact_Protein4 feat_impact_Protein5
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918 -24.200000 -26.170000 -33.820000 30.300000 25.460000 26.020000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000 -22.480000 -23.820000 -31.800000 30.130000 25.180000 25.680000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848 -16.030000 -15.130000 -7.370000 8.980000 15.660000 15.900000
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955 -21.450000 -19.400000 -22.280000 21.480000 19.200000 19.260000
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000 -15.850000 -15.480000 -4.730000 9.120000 14.500000 13.140000

To include the impact of a specific sample, use the sample_positions parameter indicating the position index of the sample within the shap_values attribute (the same as in the labels provided to the ShapModel().fit() method). You need to set drop=True to override the feature impact columns:

# First protein
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=0)
aa.display_df(df_feat, n_cols=-1)
  feat_impact_Protein0
1 -24.200000
2 -22.480000
3 -16.030000
4 -21.450000
5 -15.850000

You can provide a specific names for the corresponding sample:

# Single sample
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)
  feat_impact_Selected_sample
1 -24.200000
2 -22.480000
3 -16.030000
4 -21.450000
5 -15.850000

Computing feature impact

Three different scenarios are possible:

  1. Single sample: Compute the feature impact for a single sample (above).

  2. Multiple samples: Compute the feature impact for multiple samples (all by default).

  3. Group of samples: Compute the average feature impact and standard deviation for a group.

To focus on specific samples, specify their indices in sample_positions. If names is provided, its length should match sample_positions.

# Multiple samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)
  feat_impact_Sample 1 feat_impact_Sample 2
1 -24.200000 -26.170000
2 -22.480000 -23.820000
3 -16.030000 -15.130000
4 -21.450000 -19.400000
5 -15.850000 -15.480000

To calculate the group average, set group_average=True and specify the sample indices in sample_positions. Provide a names for the group, or ‘Group’ will be used by default:

# Group of samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, sample_positions=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-2)
  feat_impact_Group feat_impact_std_Group
1 -25.220000 1.975320
2 -23.180000 1.578870
3 -15.560000 0.162996
4 -20.390000 0.219933
5 -15.660000 0.432344

Setting shap_feat_importance=True, will compute the SHAP value-based feature importance:

# SHAP value-based feature importance
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, shap_feat_importance=True)
aa.display_df(df_feat, n_cols=-1)
  feat_importance
1 27.500000
2 26.370000
3 13.360000
4 20.440000
5 12.320000