ShapModel.fit

ShapModel.fit(X, labels, label_target_class=1, n_rounds=5, is_selected=None, fuzzy_labeling=False, n_background_data=None, df_seq=None, fuzzy_labels=None)[source]

Obtain SHapley Additive exPlanations (SHAP) values aggregated across prediction models and training rounds.

For each round and feature-selection subset, the method trains each model in list_model_classes and applies the configured SHAP explainer [Lundberg20] to compute per-sample feature attributions. All SHAP values are averaged across rounds, feature selections, and models, then stored in shap_values. Pass the result to ShapModel.add_feat_impact() to attach impact scores to a feature DataFrame.

Added in version 0.1.0.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).

  • label_target_class (int, default=1) – The label of the class for which SHAP values are computed in a classification tasks. For binary classification, ‘0’ represents the negative class and ‘1’ the positive class.

  • n_rounds (int, default=5) – The number of rounds (>=1) to fit the models and obtain the SHAP values by explainer.

  • is_selected (array-like, shape (n_selection_round, n_features)) – 2D boolean arrays indicating different feature selections.

  • fuzzy_labeling (bool, default=False) – If True, fuzzy labeling is applied to approximate SHAP values for samples with uncertain/partial memberships (e.g., between >0 and <1 for binary classification scenarios).

  • n_background_data (None or int, optional) – The number samples (< ‘n_samples’) in the background dataset used for the KernelExplainer` to reduce computation time. The dataset is obtained by k-means clustering. If None, the full dataset ‘X’ is used.

  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with a unique protein identifier per row, row-aligned to X. Required when fuzzy_labels is given, to map entry names to the corresponding rows of X.

  • fuzzy_labels (dict, optional) – Soft labels keyed by entry, e.g. {'P05067': 0.6}. Each value (in [0, 1]) overrides the label of the matching entry in labels and enables fuzzy labeling. Aligned to X via df_seq, this avoids the manual row-index lookup and array mutation otherwise needed to set a soft label.

Returns:

The fitted ShapModel model instance.

Return type:

ShapModel

Notes

Fuzzy Labeling

  • Aim: Compute SHAP value for datasets with uncertain or ambiguous labels. Especially useful to explain newly predicted samples, where class label is set to the respective prediction probability.

  • Approach: Uses probabilistic labels to represent degrees of membership.

  • Idea: Adjusts label thresholds dynamically in Monte Carlo estimation to better represent label uncertainties.

  • Background: Inspired by fuzzy logic, replacing binary true/false with degrees of truth.

Setting soft labels

There are two equivalent ways to provide soft labels, both enabling fuzzy labeling:

  • Pass a float labels array directly (e.g. 0.6 in place of a binary 0/1) with fuzzy_labeling=True.

  • Pass binary (or float) labels together with df_seq and fuzzy_labels keyed by entry. The fuzzy_labels values override the matching entries in labels; fuzzy labeling is enabled automatically. This is the recommended path, as it refers to proteins by accession rather than row index.

See also

  • [Breimann25] introduces fuzzy labeling to compute Monte Carlo estimates of SHAP values for samples with not clearly defined class membership.

Examples

To demonstrate the ShapModel().fit() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import shap
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)
# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq, )
/Users/stephanbreimann/Programming/1Packages/aaanalysis-shap-acc/aaanalysis/feature_engineering/_backend/cpp_run.py:143: UserWarning: CPP is using the Python kernel fallback — the compiled Cython extension is not available in this install. Output is bit-exact with the Cython path but ~2x slower. Reinstall via pip install --force-reinstall aaanalysis to fetch a prebuilt wheel.
  warnings.warn(
  entry sequence label tmd_start tmd_stop jmd_n tmd jmd_c
1 Q14802 MQKVTLGLLVFLAGF...PGETPPLITPGSAQS 0 37 59 NSPFYYDWHS LQVGGLICAGVLCAMGIIIVMSA KCKCKFGQKS
2 Q86UE4 MAARSWQDELAQQAE...SPKQIKKKKKARRET 0 50 72 LGLEPKRYPG WVILVGTGALGLLLLFLLGYGWA AACAGARKKR
3 Q969W9 MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL 0 41 63 FQSMEITELE FVQIIIIVVVMMVMVVVITCLLS HYKLSARSFI
4 P05067 MLPGLALLLLAAWTA...GYENPTYKFFEQMQN 1 701 723 FAEDVGSNKG AIIGLMVGGVVIATVIVITLVML KKKQYTSIHH
5 P14925 MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS 1 868 890 KLSTEPGSGV SVVLITTLLVIPVLVLLAIVMFI RWKKSRAFGD
6 P70180 MRSLLLFTFSACVLL...RELREDSIRSHFSVA 1 477 499 PCKSSGGLEE SAVTGIVVGALLGAGLLMAFYFF RKKYRITIER

We can now create a ShapModel object and fit it to obtain the SHAP values and the expected value using the shap_values and exp_value (expected/base value) attributes:

sm = aa.ShapModel()
sm.fit(X, labels=labels)

shap_values = sm.shap_values
exp_value = sm.exp_value

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))

print("\nThe expected value approximates the expected model output (average prediction score).")
print("For a binary classification with balanced datasets, it is around 0.5:")
print(exp_value)
SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.1  -0.1  -0.08 -0.09 -0.06]
 [-0.12 -0.12 -0.09 -0.1  -0.07]
 [-0.13 -0.14 -0.04 -0.09 -0.01]
 [ 0.13  0.13  0.05  0.09  0.04]
 [ 0.13  0.12  0.08  0.09  0.07]
 [ 0.13  0.12  0.08  0.09  0.06]]

The expected value approximates the expected model output (average prediction score).
For a binary classification with balanced datasets, it is around 0.5:
0.49566666666666687

SHAP values are computed with respect to the classification class, which can be adjusted using the label_target_class parameter (default=1, standing for the positive class):

sm = aa.ShapModel()
# Reverse sign of SHAP values by setting class to 0
sm.fit(X, labels=labels, label_target_class=0)

shap_values = sm.shap_values
exp_value = sm.exp_value

print("Reverse sign of SHAP values by changing reference class from 1 to 0")
print(shap_values.round(2))
print("\nBase value stays around 0.5:")
print(exp_value)
Reverse sign of SHAP values by changing reference class from 1 to 0
[[ 0.11  0.1   0.1   0.09  0.06]
 [ 0.13  0.12  0.1   0.08  0.07]
 [ 0.15  0.14  0.04  0.08  0.01]
 [-0.14 -0.13 -0.07 -0.08 -0.04]
 [-0.13 -0.12 -0.09 -0.09 -0.07]
 [-0.13 -0.12 -0.09 -0.08 -0.06]]

Base value stays around 0.5:
0.5036666666666669

To obtain Monte Carlo estimates of the both, the ShapModel().fit() method performs 5 rounds of model fitting and averages the shap_values and exp_value across all rounds. The number of rounds can be adjusted using the n_rounds (default=5) parameter:

sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, n_rounds=10)

Pre-selection of features can be provided using the is_selected parameter:

# Create pre-selection arrays (top 2 and top 4 features will be selected)
is_selected = [[1, 1, 0, 0, 0],
               [1, 1, 1, 1, 0]]
sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, is_selected=is_selected)

print("Impact of feature pre-selection")
print(sm.shap_values.round(2))
Impact of feature pre-selection
[[-0.16 -0.19 -0.05 -0.05  0.  ]
 [-0.18 -0.21 -0.05 -0.05  0.  ]
 [-0.18 -0.21 -0.02 -0.05  0.  ]
 [ 0.18  0.21  0.03  0.05  0.  ]
 [ 0.18  0.2   0.05  0.05  0.  ]
 [ 0.18  0.21  0.05  0.05  0.  ]]

Obtain a reliable shap value estimation for a fuzzy labeled sample (0 < label < 1) by setting fuzyy_labeling=True:

# Create fuzzy label
labels[0] = 0.5
sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, is_selected=is_selected, fuzzy_labeling=True)

print("First sample is labeled as 0.5 between negative (0) and positive (1)")
print(sm.shap_values.round(2))
First sample is labeled as 0.5 between negative (0) and positive (1)
[[ 0.04  0.04 -0.03 -0.04  0.  ]
 [-0.24 -0.25 -0.05 -0.05  0.  ]
 [-0.22 -0.22 -0.02 -0.04  0.  ]
 [ 0.15  0.16  0.03  0.04  0.  ]
 [ 0.14  0.16  0.04  0.04  0.  ]
 [ 0.14  0.16  0.04  0.04  0.  ]]

Instead of locating the sample’s row and mutating a label copy by hand, soft labels can be provided by accession using fuzzy_labels together with df_seq. Each value (in [0, 1]) overrides the matching entry in labels and enables fuzzy labeling, aligned to the rows of X via the entry column:

# Soft label keyed by entry/accession (no manual row-index lookup)
entry = df_seq["entry"].iloc[0]
sm = aa.ShapModel()
sm = sm.fit(X, labels=df_seq["label"].to_list(), df_seq=df_seq,
            is_selected=is_selected, fuzzy_labels={entry: 0.5})

print(f"Sample '{entry}' is labeled as 0.5 between negative (0) and positive (1)")
print(sm.shap_values.round(2))
Sample 'Q14802' is labeled as 0.5 between negative (0) and positive (1)
[[ 0.03  0.04 -0.02 -0.04  0.  ]
 [-0.25 -0.24 -0.04 -0.05  0.  ]
 [-0.23 -0.22 -0.01 -0.04  0.  ]
 [ 0.15  0.15  0.02  0.04  0.  ]
 [ 0.15  0.15  0.04  0.05  0.  ]
 [ 0.15  0.15  0.04  0.05  0.  ]]

If the model-agnostic KernelExplainer is used, a subset of the given dataset can be provided obtain by internal clustering and selecting a representative sample per cluster. The number of samples can be set by n_background_data (by default=None disabled):

from sklearn.svm import SVC

# Use KernelExplainer to obtain SHAP values for any prediction model
sm = aa.ShapModel(explainer_class=shap.KernelExplainer, list_model_classes=[SVC])