ShapModel.fit

ShapModel.fit(X, labels, label_target_class=1, n_rounds=5, is_selected=None, fuzzy_labeling=False, fuzzy_aggregation='interpolate', n_background_data=None, df_seq=None, fuzzy_labels=None)[source]

Obtain SHapley Additive exPlanations (SHAP) values aggregated across prediction models and training rounds.

For each round and feature-selection subset, the method trains each model in list_model_classes and applies the configured SHAP explainer [Lundberg20] to compute per-sample feature attributions. All SHAP values are averaged across rounds, feature selections, and models, then stored in shap_values. Pass the result to ShapModel.add_feat_impact() to attach impact scores to a feature DataFrame.

Added in version 0.1.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).
label_target_class (int, default=1) – The label of the class for which SHAP values are computed in a classification tasks. For binary classification, ‘0’ represents the negative class and ‘1’ the positive class.
n_rounds (int, default=5) – The number of rounds (>=1) to fit the models and obtain the SHAP values by explainer. For fuzzy_aggregation='interpolate' each round re-seeds the fit, so n_rounds is a speed/stability dial (see Notes): 1 is the fast exact two-fit estimate, the default 5 adds Monte-Carlo averaging, and a stable mean is reached around 15-20.
is_selected (array-like, shape (n_selection_round, n_features)) – 2D boolean arrays indicating different feature selections.
fuzzy_labeling (bool, default=False) – If True, fuzzy labeling is applied to approximate SHAP values for samples with uncertain/partial memberships (e.g., between >0 and <1 for binary classification scenarios).
fuzzy_aggregation (str, default='interpolate') –
Strategy to turn a soft label p into a SHAP estimate when fuzzy labeling is active (see Notes):
- 'interpolate' (default, new in 1.1): blend p * S1 + (1 - p) * S0 from a fit at 0 and at 1 (unbiased, exact p; with n_rounds=1 only two fits per fuzzy sample).
- 'threshold': hard-label the fuzzy sample over a threshold grid and average — the biased sweep of [Breimann25]; kept for backward-compatible results.
n_background_data (None or int, optional) – The number samples (< ‘n_samples’) in the background dataset used for the KernelExplainer` to reduce computation time. The dataset is obtained by k-means clustering. If None, the full dataset ‘X’ is used.
df_seq (pd.DataFrame, shape (n_samples, n_seq_info), optional) – DataFrame containing an entry column with a unique protein identifier per row, row-aligned to X. Required when fuzzy_labels is given, to map entry names to the corresponding rows of X.
fuzzy_labels (dict, optional) – Soft labels keyed by entry, e.g. {'P05067': 0.6}. Each value (in [0, 1]) overrides the label of the matching entry in labels and enables fuzzy labeling. Aligned to X via df_seq, this avoids the manual row-index lookup and array mutation otherwise needed to set a soft label.

Returns:

The fitted ShapModel model instance.

Return type:

ShapModel

Notes

Fuzzy Labeling

Aim: Compute SHAP value for datasets with uncertain or ambiguous labels. Especially useful to explain newly predicted samples, where class label is set to the respective prediction probability.
Approach: Uses probabilistic labels to represent degrees of membership.
Idea: Adjusts label thresholds dynamically in Monte Carlo estimation to better represent label uncertainties.
Background: Inspired by fuzzy logic, replacing binary true/false with degrees of truth.

Fuzzy aggregation strategies

The fuzzy_aggregation parameter selects between two estimators:

'interpolate' (default): The fuzzy sample is weighted by exactly p by fitting the model twice (fuzzy sample at 0 -> S0, at 1 -> S1) and blending p * S1 + (1 - p) * S0 (the exp_value is blended the same way). This is unbiased. Each fuzzy protein is explained independently against the fixed balanced 0/1 core, with the other fuzzy proteins excluded from its training data.
'threshold': Over an n_rounds x n_selection grid the fuzzy sample is hard-labeled 1 when a per-cell threshold <= p and the per-cell SHAP matrices are averaged — the sweep of [Breimann25]. Because the grid is non-uniform on (0, 1], the effective positive-fraction is a biased approximation of p.

Per-round seeding (interpolate only)

The constructor random_state is the initial seed, and 'interpolate' re-seeds each round with random_state + round (round 0 -> random_state, round 1 -> random_state + 1, …). So every round fits a different model and n_rounds averages a Monte-Carlo mean over model variance, yet a fixed random_state gives the identical seed sequence and therefore an exactly reproducible result; random_state=None draws fresh entropy each round (truly-random, non-reproducible). The 'threshold' estimator and the non-fuzzy Monte-Carlo path do not re-seed per round — they bake random_state in once, so their per-round variation comes from the threshold grid, not from the model seed.

Choosing n_rounds for ‘interpolate’

Because each round re-seeds, n_rounds is a speed/stability dial:

n_rounds=1 – the exact two-fit point estimate; fastest, but a single model draw (run-to-run spread ~20% across seeds).
n_rounds=5 (default) – adds light averaging (spread ~10%).
n_rounds≈15-20 – the averaged estimate stabilizes (run-to-run spread and distance to the converged mean fall below ~5% on the bundled DOM_GSEC gamma-secretase data, ~1/sqrt(n_rounds) decay). Use this for a stable mean; with a fixed random_state any single run is exactly reproducible regardless.

Setting soft labels

There are two equivalent ways to provide soft labels, both enabling fuzzy labeling:

Pass a float labels array directly (e.g. 0.6 in place of a binary 0/1) with fuzzy_labeling=True.
Pass binary (or float) labels together with df_seq and fuzzy_labels keyed by entry. The fuzzy_labels values override the matching entries in labels; fuzzy labeling is enabled automatically. This is the recommended path, as it refers to proteins by accession rather than row index.

See also

[Breimann25] introduces fuzzy labeling to compute Monte Carlo estimates of SHAP values for samples with not clearly defined class membership.

Examples

To demonstrate the ShapModel().fit() method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import shap
import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC", n=3)
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(5)
# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

aa.display_df(df_seq, )

	entry	sequence	label	tmd_start	tmd_stop	jmd_n	tmd	jmd_c
1	Q14802	MQKVTLGLLVFLAGF...PGETPPLITPGSAQS	0	37	59	NSPFYYDWHS	LQVGGLICAGVLCAMGIIIVMSA	KCKCKFGQKS
2	Q86UE4	MAARSWQDELAQQAE...SPKQIKKKKKARRET	0	50	72	LGLEPKRYPG	WVILVGTGALGLLLLFLLGYGWA	AACAGARKKR
3	Q969W9	MHRLMGVNSTAAAAA...AIWSKEKDKQKGHPL	0	41	63	FQSMEITELE	FVQIIIIVVVMMVMVVVITCLLS	HYKLSARSFI
4	P05067	MLPGLALLLLAAWTA...GYENPTYKFFEQMQN	1	701	723	FAEDVGSNKG	AIIGLMVGGVVIATVIVITLVML	KKKQYTSIHH
5	P14925	MAGRARSGLLLLLLG...EEEYSAPLPKPAPSS	1	868	890	KLSTEPGSGV	SVVLITTLLVIPVLVLLAIVMFI	RWKKSRAFGD
6	P70180	MRSLLLFTFSACVLL...RELREDSIRSHFSVA	1	477	499	PCKSSGGLEE	SAVTGIVVGALLGAGLLMAFYFF	RKKYRITIER

We can now create a ShapModel object and fit it to obtain the SHAP values and the expected value using the shap_values and exp_value (expected/base value) attributes:

sm = aa.ShapModel()
sm.fit(X, labels=labels)

shap_values = sm.shap_values
exp_value = sm.exp_value

# Print SHAP values and expected value
print("SHAP values explain the feature impact for 3 negative and 3 positive samples")
print(shap_values.round(2))

print("\nThe expected value approximates the expected model output (average prediction score).")
print("For a binary classification with balanced datasets, it is around 0.5:")
print(exp_value)

SHAP values explain the feature impact for 3 negative and 3 positive samples
[[-0.11 -0.1  -0.08 -0.09 -0.07]
 [-0.12 -0.12 -0.08 -0.09 -0.07]
 [-0.14 -0.15 -0.03 -0.09 -0.02]
 [ 0.13  0.13  0.05  0.09  0.04]
 [ 0.13  0.12  0.08  0.09  0.07]
 [ 0.13  0.13  0.08  0.09  0.06]]

The expected value approximates the expected model output (average prediction score).
For a binary classification with balanced datasets, it is around 0.5:
0.4940000000000003

SHAP values are computed with respect to the classification class, which can be adjusted using the label_target_class parameter (default=1, standing for the positive class):

sm = aa.ShapModel()
# Reverse sign of SHAP values by setting class to 0
sm.fit(X, labels=labels, label_target_class=0)

shap_values = sm.shap_values
exp_value = sm.exp_value

print("Reverse sign of SHAP values by changing reference class from 1 to 0")
print(shap_values.round(2))
print("\nBase value stays around 0.5:")
print(exp_value)

Reverse sign of SHAP values by changing reference class from 1 to 0
[[ 0.12  0.09  0.07  0.1   0.07]
 [ 0.13  0.12  0.08  0.09  0.08]
 [ 0.15  0.13  0.04  0.09  0.02]
 [-0.14 -0.12 -0.04 -0.09 -0.03]
 [-0.13 -0.12 -0.08 -0.09 -0.07]
 [-0.13 -0.12 -0.08 -0.09 -0.06]]

Base value stays around 0.5:
0.4970000000000002

To obtain Monte Carlo estimates of the both, the ShapModel().fit() method performs 5 rounds of model fitting and averages the shap_values and exp_value across all rounds. The number of rounds can be adjusted using the n_rounds (default=5) parameter:

sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, n_rounds=10)

Pre-selection of features can be provided using the is_selected parameter:

# Create pre-selection arrays (top 2 and top 4 features will be selected)
is_selected = [[1, 1, 0, 0, 0],
               [1, 1, 1, 1, 0]]
sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, is_selected=is_selected)

print("Impact of feature pre-selection")
print(sm.shap_values.round(2))

Impact of feature pre-selection
[[-0.18 -0.17 -0.05 -0.06  0.  ]
 [-0.2  -0.18 -0.05 -0.06  0.  ]
 [-0.2  -0.19 -0.02 -0.06  0.  ]
 [ 0.2   0.19  0.03  0.06  0.  ]
 [ 0.2   0.19  0.05  0.06  0.  ]
 [ 0.2   0.19  0.05  0.06  0.  ]]

Obtain a reliable shap value estimation for a fuzzy labeled sample (0 < label < 1) by setting fuzyy_labeling=True:

# Create fuzzy label
labels[0] = 0.5
sm = aa.ShapModel()
sm = sm.fit(X, labels=labels, is_selected=is_selected, fuzzy_labeling=True)

print("First sample is labeled as 0.5 between negative (0) and positive (1)")
print(sm.shap_values.round(2))

First sample is labeled as 0.5 between negative (0) and positive (1)
[[-0.04 -0.04 -0.02 -0.02  0.  ]
 [-0.23 -0.24 -0.04 -0.05  0.  ]
 [-0.22 -0.23 -0.02 -0.03  0.  ]
 [ 0.17  0.17  0.02  0.04  0.  ]
 [ 0.17  0.17  0.03  0.04  0.  ]
 [ 0.17  0.17  0.03  0.04  0.  ]]

By default fuzzy_aggregation='interpolate' weights the fuzzy sample by exactly p (the cell above used it: two fits blended as p*S1 + (1-p)*S0). The published threshold sweep stays available via fuzzy_aggregation='threshold'. For interpolate, n_rounds is a speed/stability dial: n_rounds=1 is the fast exact estimate, the default 5 adds light Monte-Carlo averaging, and a stable mean is reached around n_rounds ~ 15-20:

# The published threshold-sweep estimator, available via fuzzy_aggregation="threshold"
sm = aa.ShapModel(random_state=42)
sm = sm.fit(X, labels=labels, is_selected=is_selected,
            fuzzy_labeling=True, fuzzy_aggregation="threshold")
print("Threshold-sweep estimate (first, fuzzy sample):")
print(sm.shap_values[0].round(2))

# Stable interpolate mean: n_rounds=1 is the fast exact blend, ~15-20 the converged mean
sm = aa.ShapModel(random_state=42)
sm = sm.fit(X, labels=labels, is_selected=is_selected,
            fuzzy_labeling=True, n_rounds=15)
print("\nConverged interpolate mean over 15 rounds (first, fuzzy sample):")
print(sm.shap_values[0].round(2))

Threshold-sweep estimate (first, fuzzy sample):
[ 0.03  0.04 -0.03 -0.03  0.  ]

Converged interpolate mean over 15 rounds (first, fuzzy sample):
[-0.03 -0.04 -0.02 -0.02  0.  ]

The constructor random_state seeds the estimator. For 'interpolate' it is the initial seed and each round re-seeds with random_state + round, so n_rounds>1 averages genuinely different model fits yet stays exactly reproducible for a fixed seed (random_state=None instead draws fresh entropy each round). The 'threshold' and non-fuzzy paths do not re-seed per round:

# Same random_state -> identical result (reproducible), even with averaging over rounds
a = aa.ShapModel(random_state=42).fit(X, labels=labels, is_selected=is_selected,
                                      fuzzy_labeling=True, n_rounds=5).shap_values
b = aa.ShapModel(random_state=42).fit(X, labels=labels, is_selected=is_selected,
                                      fuzzy_labeling=True, n_rounds=5).shap_values
print("Reproducible for a fixed random_state (per-round seed = random_state + round):",
      bool((a == b).all()))

Reproducible for a fixed random_state (per-round seed = random_state + round): True

Instead of locating the sample’s row and mutating a label copy by hand, soft labels can be provided by accession using fuzzy_labels together with df_seq. Each value (in [0, 1]) overrides the matching entry in labels and enables fuzzy labeling, aligned to the rows of X via the entry column:

# Soft label keyed by entry/accession (no manual row-index lookup)
entry = df_seq["entry"].iloc[0]
sm = aa.ShapModel()
sm = sm.fit(X, labels=df_seq["label"].to_list(), df_seq=df_seq,
            is_selected=is_selected, fuzzy_labels={entry: 0.5})

print(f"Sample '{entry}' is labeled as 0.5 between negative (0) and positive (1)")
print(sm.shap_values.round(2))

Sample 'Q14802' is labeled as 0.5 between negative (0) and positive (1)
[[-0.04 -0.04 -0.03 -0.02  0.  ]
 [-0.24 -0.22 -0.05 -0.05  0.  ]
 [-0.23 -0.21 -0.02 -0.03  0.  ]
 [ 0.18  0.17  0.03  0.04  0.  ]
 [ 0.17  0.17  0.04  0.04  0.  ]
 [ 0.17  0.17  0.04  0.04  0.  ]]

If the model-agnostic KernelExplainer is used, a subset of the given dataset can be provided obtain by internal clustering and selecting a representative sample per cluster. The number of samples can be set by n_background_data (by default=None disabled):

from sklearn.svm import SVC

# Use KernelExplainer to obtain SHAP values for any prediction model
sm = aa.ShapModel(explainer_class=shap.KernelExplainer, list_model_classes=[SVC])
# n_background_data clusters the samples and keeps one representative per cluster as the
# KernelExplainer background dataset (here 2 background samples), speeding up computation:
labels_int = df_seq["label"].to_list()  # crisp 0/1 labels (an earlier cell set a fuzzy label)
sm = sm.fit(X, labels=labels_int, n_background_data=2)
print("SHAP values via KernelExplainer with 2 background samples:")
print(sm.shap_values.round(2))

0%|          | 0/6 [00:00<?, ?it/s]

0%|          | 0/6 [00:00<?, ?it/s]

0%|          | 0/6 [00:00<?, ?it/s]

0%|          | 0/6 [00:00<?, ?it/s]

0%|          | 0/6 [00:00<?, ?it/s]

SHAP values via KernelExplainer with 2 background samples:
[[ 0.    0.   -0.28 -0.28 -0.28]
 [-0.17 -0.17 -0.17 -0.17 -0.17]
 [-0.21 -0.21 -0.21 -0.21  0.  ]
 [ 0.03  0.03  0.03  0.03  0.03]
 [ 0.04  0.    0.04  0.04  0.04]
 [ 0.04  0.    0.04  0.04  0.04]]