ShapModel.fit
- ShapModel.fit(X, labels, label_target_class=1, n_rounds=5, is_selected=None, fuzzy_labeling=False, n_background_data=None, df_seq=None, fuzzy_labels=None)[source]
Obtain SHapley Additive exPlanations (SHAP) values aggregated across prediction models and training rounds.
For each round and feature-selection subset, the method trains each model in
list_model_classesand applies the configured SHAP explainer [Lundberg20] to compute per-sample feature attributions. All SHAP values are averaged across rounds, feature selections, and models, then stored inshap_values. Pass the result toShapModel.add_feat_impact()to attach impact scores to a feature DataFrame.Added in version 0.1.0.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in
X(typically, 1=positive, 0=negative).label_target_class (int, default=1) – The label of the class for which SHAP values are computed in a classification tasks. For binary classification, ‘0’ represents the negative class and ‘1’ the positive class.
n_rounds (int, default=5) – The number of rounds (>=1) to fit the models and obtain the SHAP values by explainer.
is_selected (array-like, shape (n_selection_round, n_features)) – 2D boolean arrays indicating different feature selections.
fuzzy_labeling (bool, default=False) – If
True, fuzzy labeling is applied to approximate SHAP values for samples with uncertain/partial memberships (e.g., between >0 and <1 for binary classification scenarios).n_background_data (None or int, optional) – The number samples (< ‘n_samples’) in the background dataset used for the KernelExplainer` to reduce computation time. The dataset is obtained by k-means clustering. If
None, the full dataset ‘X’ is used.df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an
entrycolumn with a unique protein identifier per row, row-aligned toX. Required whenfuzzy_labelsis given, to map entry names to the corresponding rows ofX.fuzzy_labels (dict, optional) – Soft labels keyed by entry, e.g.
{'P05067': 0.6}. Each value (in [0, 1]) overrides the label of the matching entry inlabelsand enables fuzzy labeling. Aligned toXviadf_seq, this avoids the manual row-index lookup and array mutation otherwise needed to set a soft label.
- Returns:
The fitted ShapModel model instance.
- Return type:
Notes
Fuzzy Labeling
Aim: Compute SHAP value for datasets with uncertain or ambiguous labels. Especially useful to explain newly predicted samples, where class label is set to the respective prediction probability.
Approach: Uses probabilistic labels to represent degrees of membership.
Idea: Adjusts label thresholds dynamically in Monte Carlo estimation to better represent label uncertainties.
Background: Inspired by fuzzy logic, replacing binary true/false with degrees of truth.
Setting soft labels
There are two equivalent ways to provide soft labels, both enabling fuzzy labeling:
Pass a float
labelsarray directly (e.g.0.6in place of a binary0/1) withfuzzy_labeling=True.Pass binary (or float)
labelstogether withdf_seqandfuzzy_labelskeyed by entry. Thefuzzy_labelsvalues override the matching entries inlabels; fuzzy labeling is enabled automatically. This is the recommended path, as it refers to proteins by accession rather than row index.
See also
[Breimann25] introduces fuzzy labeling to compute Monte Carlo estimates of SHAP values for samples with not clearly defined class membership.
Examples
To demonstrate the
ShapModel().fit()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):import shap import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC", n=3) labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(5) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts) aa.display_df(df_seq, )
/Users/stephanbreimann/Programming/1Packages/aaanalysis-shap-acc/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel. warnings.warn(
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER We can now create a
ShapModelobject and fit it to obtain the SHAP values and the expected value using theshap_valuesandexp_value(expected/base value) attributes:sm = aa.ShapModel() sm.fit(X, labels=labels) shap_values = sm.shap_values exp_value = sm.exp_value # Print SHAP values and expected value print("SHAP values explain the feature impact for 3 negative and 3 positive samples") print(shap_values.round(2)) print("\nThe expected value approximates the expected model output (average prediction score).") print("For a binary classification with balanced datasets, it is around 0.5:") print(exp_value)
SHAP values explain the feature impact for 3 negative and 3 positive samples [[-0.1 -0.1 -0.08 -0.09 -0.06] [-0.12 -0.12 -0.09 -0.1 -0.07] [-0.13 -0.14 -0.04 -0.09 -0.01] [ 0.13 0.13 0.05 0.09 0.04] [ 0.13 0.12 0.08 0.09 0.07] [ 0.13 0.12 0.08 0.09 0.06]] The expected value approximates the expected model output (average prediction score). For a binary classification with balanced datasets, it is around 0.5: 0.49566666666666687
SHAP values are computed with respect to the classification class, which can be adjusted using the
label_target_classparameter (default=1, standing for the positive class):sm = aa.ShapModel() # Reverse sign of SHAP values by setting class to 0 sm.fit(X, labels=labels, label_target_class=0) shap_values = sm.shap_values exp_value = sm.exp_value print("Reverse sign of SHAP values by changing reference class from 1 to 0") print(shap_values.round(2)) print("\nBase value stays around 0.5:") print(exp_value)
Reverse sign of SHAP values by changing reference class from 1 to 0 [[ 0.11 0.1 0.1 0.09 0.06] [ 0.13 0.12 0.1 0.08 0.07] [ 0.15 0.14 0.04 0.08 0.01] [-0.14 -0.13 -0.07 -0.08 -0.04] [-0.13 -0.12 -0.09 -0.09 -0.07] [-0.13 -0.12 -0.09 -0.08 -0.06]] Base value stays around 0.5: 0.5036666666666669
To obtain Monte Carlo estimates of the both, the
ShapModel().fit()method performs 5 rounds of model fitting and averages theshap_valuesandexp_valueacross all rounds. The number of rounds can be adjusted using then_rounds(default=5) parameter:sm = aa.ShapModel() sm = sm.fit(X, labels=labels, n_rounds=10)
Pre-selection of features can be provided using the
is_selectedparameter:# Create pre-selection arrays (top 2 and top 4 features will be selected) is_selected = [[1, 1, 0, 0, 0], [1, 1, 1, 1, 0]] sm = aa.ShapModel() sm = sm.fit(X, labels=labels, is_selected=is_selected) print("Impact of feature pre-selection") print(sm.shap_values.round(2))
Impact of feature pre-selection [[-0.16 -0.19 -0.05 -0.05 0. ] [-0.18 -0.21 -0.05 -0.05 0. ] [-0.18 -0.21 -0.02 -0.05 0. ] [ 0.18 0.21 0.03 0.05 0. ] [ 0.18 0.2 0.05 0.05 0. ] [ 0.18 0.21 0.05 0.05 0. ]]
Obtain a reliable shap value estimation for a fuzzy labeled sample (0 < label < 1) by setting
fuzyy_labeling=True:# Create fuzzy label labels[0] = 0.5 sm = aa.ShapModel() sm = sm.fit(X, labels=labels, is_selected=is_selected, fuzzy_labeling=True) print("First sample is labeled as 0.5 between negative (0) and positive (1)") print(sm.shap_values.round(2))
First sample is labeled as 0.5 between negative (0) and positive (1) [[ 0.04 0.04 -0.03 -0.04 0. ] [-0.24 -0.25 -0.05 -0.05 0. ] [-0.22 -0.22 -0.02 -0.04 0. ] [ 0.15 0.16 0.03 0.04 0. ] [ 0.14 0.16 0.04 0.04 0. ] [ 0.14 0.16 0.04 0.04 0. ]]
Instead of locating the sample’s row and mutating a label copy by hand, soft labels can be provided by accession using
fuzzy_labelstogether withdf_seq. Each value (in [0, 1]) overrides the matching entry inlabelsand enables fuzzy labeling, aligned to the rows ofXvia theentrycolumn:# Soft label keyed by entry/accession (no manual row-index lookup) entry = df_seq["entry"].iloc[0] sm = aa.ShapModel() sm = sm.fit(X, labels=df_seq["label"].to_list(), df_seq=df_seq, is_selected=is_selected, fuzzy_labels={entry: 0.5}) print(f"Sample '{entry}' is labeled as 0.5 between negative (0) and positive (1)") print(sm.shap_values.round(2))
Sample 'Q14802' is labeled as 0.5 between negative (0) and positive (1) [[ 0.03 0.04 -0.02 -0.04 0. ] [-0.25 -0.24 -0.04 -0.05 0. ] [-0.23 -0.22 -0.01 -0.04 0. ] [ 0.15 0.15 0.02 0.04 0. ] [ 0.15 0.15 0.04 0.05 0. ] [ 0.15 0.15 0.04 0.05 0. ]]
If the model-agnostic
KernelExplaineris used, a subset of the given dataset can be provided obtain by internal clustering and selecting a representative sample per cluster. The number of samples can be set byn_background_data(by default=Nonedisabled):from sklearn.svm import SVC # Use KernelExplainer to obtain SHAP values for any prediction model sm = aa.ShapModel(explainer_class=shap.KernelExplainer, list_model_classes=[SVC])