aaanalysis.ShapModel.fit
- ShapModel.fit(X, labels=None, label_target_class=1, n_rounds=5, is_selected=None, fuzzy_labeling=False, n_background_data=None)[source]
Obtain SHAP values aggregated across prediction models and training rounds.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in
X(typically, 1=positive, 0=negative).label_target_class (int, default=1) – The label of the class for which SHAP values are computed in a classification tasks. For binary classification, ‘0’ represents the negative class and ‘1’ the positive class.
n_rounds (int, default=5) – The number of rounds (>=1) to fit the models and obtain the SHAP values by explainer.
is_selected (array-like, shape (n_selection_round, n_features)) – 2D boolean arrays indicating different feature selections.
fuzzy_labeling (bool, default=False) – If
True, fuzzy labeling is applied to approximate SHAP values for samples with uncertain/partial memberships (e.g., between >0 and <1 for binary classification scenarios).n_background_data (None or int, optional) – The number samples (< ‘n_samples’) in the background dataset used for the KernelExplainer` to reduce computation time. The dataset is obtained by k-means clustering. If
None, the full dataset ‘X’ is used.
- Returns:
The fitted ShapModel model instance.
- Return type:
Notes
Fuzzy Labeling
Aim: Compute SHAP value for datasets with uncertain or ambiguous labels. Especially useful to explain newly predicted samples, where class label is set to the respective prediction probability.
Approach: Uses probabilistic labels to represent degrees of membership.
Idea: Adjusts label thresholds dynamically in Monte Carlo estimation to better represent label uncertainties.
Background: Inspired by fuzzy logic, replacing binary true/false with degrees of truth.
See also
[Breimann25a] introduces fuzzy labeling to compute Monte Carlo estimates of SHAP values for samples with not clearly defined class membership.
Examples
To demonstrate the
ShapModel().fit()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):import shap import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC", n=3) labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(5) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts) aa.display_df(df_seq, )
entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c 1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS 2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR 3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI 4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH 5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD 6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER We can now create a
ShapModelobject and fit it to obtain the SHAP values and the expected value using theshap_valuesandexp_value(expected/base value) attributes:sm = aa.ShapModel() sm.fit(X, labels=labels) shap_values = sm.shap_values exp_value = sm.exp_value # Print SHAP values and expected value print("SHAP values explain the feature impact for 3 negative and 3 positive samples") print(shap_values.round(2)) print("\nThe expected value approximates the expected model output (average prediction score).") print("For a binary classification with balanced datasets, it is around 0.5:") print(exp_value)SHAP values explain the feature impact for 3 negative and 3 positive samples [[-0.11 -0.1 -0.09 -0.08 -0.08] [-0.13 -0.12 -0.09 -0.09 -0.07] [-0.15 -0.14 -0.04 -0.08 -0.03] [ 0.14 0.13 0.06 0.08 0.03] [ 0.13 0.12 0.08 0.09 0.07] [ 0.13 0.12 0.08 0.09 0.06]] The expected value approximates the expected model output (average prediction score). For a binary classification with balanced datasets, it is around 0.5: 0.4988333333333335
SHAP values are computed with respect to the classification class, which can be adjusted using the
label_target_classparameter (default=1, standing for the positive class):sm = aa.ShapModel() # Reverse sign of SHAP values by setting class to 0 sm.fit(X, labels=labels, label_target_class=0) shap_values = sm.shap_values exp_value = sm.exp_value print("Reverse sign of SHAP values by changing reference class from 1 to 0") print(shap_values.round(2)) print("\nBase value stays around 0.5:") print(exp_value)Reverse sign of SHAP values by changing reference class from 1 to 0 [[ 0.11 0.09 0.08 0.1 0.07] [ 0.12 0.12 0.08 0.09 0.08] [ 0.15 0.13 0.03 0.09 0.02] [-0.13 -0.12 -0.05 -0.09 -0.05] [-0.12 -0.12 -0.08 -0.09 -0.08] [-0.13 -0.12 -0.08 -0.09 -0.07]] Base value stays around 0.5: 0.5026666666666669
To obtain Monte Carlo estimates of the both, the
ShapModel().fit()method performs 5 rounds of model fitting and averages theshap_valuesandexp_valueacross all rounds. The number of rounds can be adjusted using then_rounds(default=5) parameter:sm = aa.ShapModel() sm = sm.fit(X, labels=labels, n_rounds=10)
Pre-selection of features can be provided using the
is_selectedparameter:# Create pre-selection arrays (top 2 and top 4 features will be selected) is_selected = [[1, 1, 0, 0, 0], [1, 1, 1, 1, 0]] sm = aa.ShapModel() sm = sm.fit(X, labels=labels, is_selected=is_selected) print("Impact of feature pre-selection") print(sm.shap_values.round(2))Impact of feature pre-selection [[-0.18 -0.17 -0.05 -0.05 0. ] [-0.19 -0.19 -0.05 -0.05 0. ] [-0.2 -0.2 -0.02 -0.05 0. ] [ 0.2 0.2 0.03 0.05 0. ] [ 0.2 0.2 0.05 0.05 0. ] [ 0.2 0.2 0.05 0.05 0. ]]
Obtain a reliable shap value estimation for a fuzzy labeled sample (0 < label < 1) by setting
fuzyy_labeling=True:# Create fuzzy label labels[0] = 0.5 sm = aa.ShapModel() sm = sm.fit(X, labels=labels, is_selected=is_selected, fuzzy_labeling=True) print("First sample is labeled as 0.5 between negative (0) and positive (1)") print(sm.shap_values.round(2))First sample is labeled as 0.5 between negative (0) and positive (1) [[ 0.04 0.03 -0.03 -0.03 0. ] [-0.24 -0.26 -0.04 -0.04 0. ] [-0.21 -0.24 -0.02 -0.03 0. ] [ 0.15 0.16 0.02 0.04 0. ] [ 0.14 0.16 0.04 0.04 0. ] [ 0.14 0.16 0.04 0.04 0. ]]
If the model-agnostic
KernelExplaineris used, a subset of the given dataset can be provided obtain by internal clustering and selecting a representative sample per cluster. The number of samples can be set byn_background_data(by default=Nonedisabled):from sklearn.svm import SVC # Use KernelExplainer to obtain SHAP values for any prediction model se = aa.ShapModel(explainer_class=shap.KernelExplainer, list_model_classes=[SVC])