ShapModel.add_feat_impact

ShapModel.add_feat_impact(df_feat, drop=False, samples=None, names=None, normalize=True, group_average=False, shap_feat_importance=False, df_seq=None, sample_positions=None)[source]

Compute SHapley Additive exPlanations (SHAP) feature impact (or importance) from SHAP values and add to the feature DataFrame.

Three different scenarios for computing the feature impact are possible:

Single sample: Computes the feature impact for a selected sample.

Multiple samples: Computes the feature impact for multiple samples (all by default).

Group of samples: Computes the average feature impact for a group of samples (+ standard deviation).

The calculated feature impacts are added to df_feat as new columns named feat_impact_'name(s)', corresponding to each sample or group. Additionally, the SHAP value-based feature importance can be included as feat_importance column.

Added in version 0.1.0.

Parameters:

df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
drop (bool, default=False) – If True, allow dropping of already existing feature impact and feature importance columns from df_feat before inserting.
samples (int, list of int, str, list of str, or None) – Sample(s) to compute the feature impact for, given either as row position(s) in shap_values or as entry name(s) (str) from the entry column of df_seq (resolved to the matching row(s)). If None, the impact for each sample will be returned.
names (str or list of str, optional) –
Unique name(s) used for the feature impact columns. When provided, they should align with samples as follows:
- Single sample: name as string and samples as integer or entry name.
- Multiple samples: name as list of string and samples as corresponding list.
- Group: name as string and samples as list, each indicating a group sample.
If samples is None (all samples are considered), name must be list with names for each sample. When samples is given as entry name(s) and names is None, the entry name(s) are used.
normalize (bool, default=True) – If True, normalize the feature impact to percentage.
group_average (bool, default=False) – If True, compute the average of samples given by samples.
shap_feat_importance (bool, default=False) – If True, include feature importance (i.e., absolute average SHAP values) instead of impact to df_feat.
df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with a unique protein identifier per row, row-aligned to the fitted samples. Required only when samples is given as entry name(s).
sample_positions (int, list of int, str, list of str, or None) – Deprecated alias for samples (removed in 1.2.0).

Returns:

df_feat – Feature DataFrame including feature impact. If the feature impact is computed for multiple samples, n=number of samples; n=1, otherwise.

Return type:

pd.DataFrame, shape (n_features, n_feature_info+n)

Notes

Feature impact (sample-level): The feature impact quantifies the positive or negative contribution of a feature to increase or decrease the model output for a specific sample (typically, prediction score). For each sample, the impact of an individual feature is represented by its corresponding SHAP value. These values are normalized such that the sum of their absolute values equals 100%.

Feature impact (group-level): The feature impact calculated for individual samples can be averaged to determine the feature impact for a group. This reflects how features influence the model’s output on average within that group.

Feature importance (SHAP value-based): The average of the feature impact across all samples is termed as shap value-based ‘feature importance’. This quantifies the overall contribution of each feature across the entire dataset.

Warning

If group_average=True, warning when the standard deviation of a feature’s impact significantly exceeds its mean impact, this may indicate an unreliable grouping.

See also

ShapModel.add_sample_mean_dif(): for computing the raw feature value difference between samples and a reference group.

Examples

To demonstrate the ShapModel().add_feat_impact() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq)

/Users/stephanbreimann/Programming/1Packages/aaanalysis-shap-acc/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(

	entry	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	0	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	0	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P05067	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	701	723	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
5	P14925	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	868	890	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
6	P70180	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	1	477	499	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER

We can now create a ShapModel object and fit it to create the shap_values, which are saved internally:

sm = aa.ShapModel()
sm.fit(X, labels=labels)

shap_values = sm.shap_values

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))

SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.1  -0.09 -0.09 -0.09 -0.07]
 [-0.12 -0.11 -0.09 -0.1  -0.08]
 [-0.14 -0.14 -0.03 -0.09 -0.03]
 [ 0.13  0.12  0.06  0.09  0.03]
 [ 0.12  0.11  0.08  0.09  0.08]
 [ 0.12  0.11  0.08  0.1   0.07]]

We can now include the feature impact (i.e., SHAP values normalized such that their absolute values sum up to 100%) by providing df_feat to the ShapModel().add_feat_impact() method:

# Add feature impact of each sample (Protein0 to Protein5)
df_feat = sm.add_feat_impact(df_feat=df_feat)
aa.display_df(df_feat)

	feature	category	subcategory	scale_name	scale_description	abs_auc	abs_mean_dif	mean_dif	std_test	std_ref	p_val_fdr_bh	positions	feat_importance	feat_importance_std	feat_impact_Protein0	feat_impact_Protein1	feat_impact_Protein2	feat_impact_Protein3	feat_impact_Protein4	feat_impact_Protein5
1	TMD_C_JMD_C-Seg...3,4)-KLEP840101	Energy	Charge	Charge	Net charge (Kle...n et al., 1984)	0.244000	0.103666	0.103666	0.106692	0.110506	0.000000	31,32,33,34,35	0.970400	1.438918	-21.500000	-24.310000	-32.990000	29.290000	24.670000	25.240000
2	TMD_C_JMD_C-Seg...3,4)-FINA910104	Conformation	α-helix (C-cap)	α-helix termination	Helix terminati...n et al., 1991)	0.243000	0.085064	0.085064	0.098774	0.096946	0.000000	31,32,33,34,35	0.000000	0.000000	-20.770000	-23.100000	-31.680000	28.370000	23.240000	23.770000
3	TMD_C_JMD_C-Seg...6,9)-LEVM760105	Shape	Side chain length	Side chain length	Radius of gyrat... (Levitt, 1976)	0.233000	0.137044	0.137044	0.161683	0.176964	0.000001	32,33	1.554800	2.109848	-20.860000	-17.740000	-7.870000	12.850000	17.160000	17.480000
4	TMD_C_JMD_C-Seg...3,4)-HUTJ700102	Energy	Entropy	Entropy	Absolute entrop...Hutchens, 1970)	0.229000	0.098224	0.098224	0.106865	0.124608	0.000001	31,32,33,34,35	3.111200	3.109955	-20.850000	-19.670000	-21.570000	21.510000	19.320000	19.710000
5	TMD_C_JMD_C-Seg...6,9)-RADA880106	ASA/Volume	Volume	Accessible surface area (ASA)	Accessible surf...olfenden, 1988)	0.223000	0.095071	0.095071	0.114758	0.132829	0.000002	32,33	0.000000	0.000000	-16.030000	-15.190000	-5.890000	7.970000	15.610000	13.790000

To include the impact of a specific sample, use the samples parameter indicating the position index of the sample within the shap_values attribute (the same as in the labels provided to the ShapModel().fit() method). You need to set drop=True to override the feature impact columns:

# First protein
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=0)
aa.display_df(df_feat, n_cols=-1)

	feat_impact_Protein0
1	-21.500000
2	-20.770000
3	-20.860000
4	-20.850000
5	-16.030000

You can provide a specific names for the corresponding sample:

# Single sample
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=0, names="Selected_sample")
aa.display_df(df_feat, n_cols=-1)

	feat_impact_Selected_sample
1	-21.500000
2	-20.770000
3	-20.860000
4	-20.850000
5	-16.030000

Samples can also be selected by accession instead of position. Provide df_seq and pass entry name(s) to samples; they are resolved to the matching row(s), and the impact column is named after the accession unless names is given:

# Select a sample by its entry/accession via df_seq
entry = df_seq["entry"].iloc[0]
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, df_seq=df_seq, samples=entry)
aa.display_df(df_feat, n_cols=-1)

	feat_impact_Q14802
1	-21.500000
2	-20.770000
3	-20.860000
4	-20.850000
5	-16.030000

Computing feature impact

Three different scenarios are possible:

Single sample: Compute the feature impact for a single sample (above).
Multiple samples: Compute the feature impact for multiple samples (all by default).
Group of samples: Compute the average feature impact and standard deviation for a group.

To focus on specific samples, specify their indices in samples. If names is provided, its length should match samples.

# Multiple samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=[0, 1], names=["Sample 1", "Sample 2"])
aa.display_df(df_feat, n_cols=-2)

	feat_impact_Sample 1	feat_impact_Sample 2
1	-21.500000	-24.310000
2	-20.770000	-23.100000
3	-20.860000	-17.740000
4	-20.850000	-19.670000
5	-16.030000	-15.190000

To calculate the group average, set group_average=True and specify the sample indices in sample_positions. Provide a names for the group, or ‘Group’ will be used by default:

# Group of samples
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, samples=[0, 1], group_average=True)
aa.display_df(df_feat, n_cols=-2)

	feat_impact_Group	feat_impact_std_Group
1	-22.960000	2.414326
2	-21.980000	2.132446
3	-19.230000	0.709827
4	-20.230000	0.305787
5	-15.590000	0.265505

Setting shap_feat_importance=True, will compute the SHAP value-based feature importance:

# SHAP value-based feature importance
df_feat = sm.add_feat_impact(df_feat=df_feat, drop=True, shap_feat_importance=True)
aa.display_df(df_feat, n_cols=-1)

	feat_importance
1	26.180000
2	25.000000
3	15.830000
4	20.390000
5	12.610000