aaanalysis.ShapModel.fit

ShapModel.fit(X, labels=None, label_target_class=1, n_rounds=5, is_selected=None, fuzzy_labeling=False, n_background_data=None)[source]

Obtain SHAP values aggregated across prediction models and training rounds.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).

  • label_target_class (int, default=1) – The label of the class for which SHAP values are computed in a classification tasks. For binary classification, ‘0’ represents the negative class and ‘1’ the positive class.

  • n_rounds (int, default=5) – The number of rounds (>=1) to fit the models and obtain the SHAP values by explainer.

  • is_selected (array-like, shape (n_selection_round, n_features)) – 2D boolean arrays indicating different feature selections.

  • fuzzy_labeling (bool, default=False) – If True, fuzzy labeling is applied to approximate SHAP values for samples with uncertain/partial memberships (e.g., between >0 and <1 for binary classification scenarios).

  • n_background_data (None or int, optional) – The number samples (< ‘n_samples’) in the background dataset used for the KernelExplainer` to reduce computation time. The dataset is obtained by k-means clustering. If None, the full dataset ‘X’ is used.

Returns:

The fitted ShapModel model instance.

Return type:

ShapModel

Notes

Fuzzy Labeling

  • Aim: Compute SHAP value for datasets with uncertain or ambiguous labels. Especially useful to explain newly predicted samples, where class label is set to the respective prediction probability.

  • Approach: Uses probabilistic labels to represent degrees of membership.

  • Idea: Adjusts label thresholds dynamically in Monte Carlo estimation to better represent label uncertainties.

  • Background: Inspired by fuzzy logic, replacing binary true/false with degrees of truth.

See also

  • [Breimann25a] introduces fuzzy labeling to compute Monte Carlo estimates of SHAP values for samples with not clearly defined class membership.

Examples

To demonstrate the ShapModel().fit() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):

import shap
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)
# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq, )
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER

We can now create a ShapModel object and fit it to obtain the SHAP values and the expected value using the shap_values and exp_value (expected/base value) attributes:

sm = aa.ShapModel()
sm.fit(X, labels=labels)

shap_values = sm.shap_values
exp_value = sm.exp_value

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))

print("\nThe expected value approximates the expected model output (average prediction score).")
print("For a binary classification with balanced datasets, it is around 0.5:")
print(exp_value)
SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.11 -0.1  -0.09 -0.08 -0.08]
 [-0.13 -0.12 -0.09 -0.09 -0.07]
 [-0.15 -0.14 -0.04 -0.08 -0.03]
 [ 0.14  0.13  0.06  0.08  0.03]
 [ 0.13  0.12  0.08  0.09  0.07]
 [ 0.13  0.12  0.08  0.09  0.06]]

The expected value approximates the expected model output (average prediction score).
For a binary classification with balanced datasets, it is around 0.5:
0.4988333333333335

SHAP values are computed with respect to the classification class, which can be adjusted using the label_target_class parameter (default=1, standing for the positive class):

sm = aa.ShapModel()
# Reverse sign of SHAP values by setting class to 0
sm.fit(X, labels=labels, label_target_class=0)

shap_values = sm.shap_values
exp_value = sm.exp_value

print("Reverse sign of SHAP values by changing reference class from 1 to 0")
print(shap_values.round(2))
print("\nBase value stays around 0.5:")
print(exp_value)
Reverse sign of SHAP values by changing reference class from 1 to 0
[[ 0.11  0.09  0.08  0.1   0.07]
 [ 0.12  0.12  0.08  0.09  0.08]
 [ 0.15  0.13  0.03  0.09  0.02]
 [-0.13 -0.12 -0.05 -0.09 -0.05]
 [-0.12 -0.12 -0.08 -0.09 -0.08]
 [-0.13 -0.12 -0.08 -0.09 -0.07]]

Base value stays around 0.5:
0.5026666666666669

To obtain Monte Carlo estimates of the both, the ShapModel().fit() method performs 5 rounds of model fitting and averages the shap_values and exp_value across all rounds. The number of rounds can be adjusted using the n_rounds (default=5) parameter:

sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, n_rounds=10)

Pre-selection of features can be provided using the is_selected parameter:

# Create pre-selection arrays (top 2 and top 4 features will be selected)
is_selected = [[1, 1, 0, 0, 0],
               [1, 1, 1, 1, 0]]
sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, is_selected=is_selected)

print("Impact of feature pre-selection")
print(sm.shap_values.round(2))
Impact of feature pre-selection
[[-0.18 -0.17 -0.05 -0.05  0.  ]
 [-0.19 -0.19 -0.05 -0.05  0.  ]
 [-0.2  -0.2  -0.02 -0.05  0.  ]
 [ 0.2   0.2   0.03  0.05  0.  ]
 [ 0.2   0.2   0.05  0.05  0.  ]
 [ 0.2   0.2   0.05  0.05  0.  ]]

Obtain a reliable shap value estimation for a fuzzy labeled sample (0 < label < 1) by setting fuzyy_labeling=True:

# Create fuzzy label
labels[0] = 0.5
sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, is_selected=is_selected, fuzzy_labeling=True)

print("First sample is labeled as 0.5 between negative (0) and positive (1)")
print(sm.shap_values.round(2))
First sample is labeled as 0.5 between negative (0) and positive (1)
[[ 0.04  0.03 -0.03 -0.03  0.  ]
 [-0.24 -0.26 -0.04 -0.04  0.  ]
 [-0.21 -0.24 -0.02 -0.03  0.  ]
 [ 0.15  0.16  0.02  0.04  0.  ]
 [ 0.14  0.16  0.04  0.04  0.  ]
 [ 0.14  0.16  0.04  0.04  0.  ]]

If the model-agnostic KernelExplainer is used, a subset of the given dataset can be provided obtain by internal clustering and selecting a representative sample per cluster. The number of samples can be set by n_background_data (by default=None disabled):

from sklearn.svm import SVC

# Use KernelExplainer to obtain SHAP values for any prediction model
se = aa.ShapModel(explainer_class=shap.KernelExplainer, list_model_classes=[SVC])