explain_features

explain_features(df_feat, df_seq, labels, list_model_classes=None, label_target_class=1, samples=None, name_test='TEST', name_ref='REF', plot=True, random_state=None, n_jobs=None, verbose=False)[source]

Explain a feature set in one call: compute per-sample SHAP impact and draw the SHAP feature map.

A thin, stateless pro facade over the explicit primitive path. It rebuilds the feature matrix X from the feature identifiers in df_feat (via SequenceFeature.get_df_parts() + SequenceFeature.feature_matrix()), fits a ShapModel, attaches the per-sample SHAP feature impact to df_feat (via ShapModel.add_feat_impact()), and draws the SHAP-coloured feature map (CPPPlot.feature_map() with shap_plot=True). The defaults are byte-identical to writing those calls by hand.

By default a single sample is explained: the label_target_class sample the models predict most confidently — the most representative correct prediction. Pass samples (an entry name, a row position, or a list) to explain chosen sample(s) instead; the feature map then colours by the first requested sample’s impact and the impacts of all of them are added to df_feat.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a feature column of feature identifiers (e.g. from CPP.run(), aaanalysis.pipe.find_features(), or load_features()).

  • df_seq (pd.DataFrame, shape (n_samples, n_seq_info)) – Sequence DataFrame with the sequence/parts information, row-aligned to labels. The feature matrix is rebuilt from it via SequenceFeature.get_df_parts().

  • labels (array-like, shape (n_samples,)) – Class labels for the samples (typically, 1=positive, 0=negative).

  • list_model_classes (list of Type[BaseEstimator], optional) – Prediction model classes passed to ShapModel. If None, the ShapModel default ([RandomForestClassifier, ExtraTreesClassifier]) is used.

  • label_target_class (int, default=1) – The class label for which SHAP values are computed and the sample is auto-selected.

  • samples (int, str, list of int, list of str, or None) – Sample(s) to explain, given as row position(s) in the feature matrix or entry name(s) from df_seq. If None, the most confidently predicted label_target_class sample is selected automatically.

  • name_test (str, default="TEST") – Name of the test (positive) group, shown on the feature map.

  • name_ref (str, default="REF") – Name of the reference (negative) group, shown on the feature map.

  • plot (bool, default=True) – If True, draw the SHAP-coloured feature map and return its Axes; if False, skip the plot and return None in the figure slot.

  • random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes (the SHAP estimation and sample selection) are reproducible.

  • n_jobs (int, optional) – Number of CPU cores (>=1) for building the feature matrix. If None, the optimized number is used.

  • verbose (bool, default=False) – If True, verbose progress information is printed.

Returns:

  • df_feat_shap (pd.DataFrame, shape (n_features, n_feature_info+n)) – df_feat with the per-sample SHAP feature impact added as feat_impact_'name' column(s).

  • ax (matplotlib.axes.Axes or None) – The Axes of the SHAP-coloured feature map, or None if plot=False.

  • evals (None) – Always None — explanation does no evaluation (keeps the uniform (results, figs, evals) pipeline return shape).

See also

  • ShapModel for the underlying Monte Carlo SHAP estimation and feature impact.

  • CPPPlot.feature_map() for the SHAP-coloured feature map (shap_plot=True).

  • aaanalysis.pipe.find_features() for obtaining df_feat (the feature discovery step).

Warning

  • This pipeline requires SHAP, which is automatically installed via pip install aaanalysis[pro].

Examples

The aaanalysis.pipe (aap) module provides high-level golden pipelines — stateless, one-call wrappers over the AAanalysis primitives. aap.explain_features is the pro explanation pipeline: given an existing df_feat and the sequences, it rebuilds the feature matrix, fits a :class:ShapModel, attaches the per-sample SHAP feature impact to df_feat, and draws the SHAP-coloured feature map. It returns the triple (df_feat_shap, ax, None) (explanation does no evaluation). It requires aaanalysis[pro] (SHAP).

import matplotlib.pyplot as plt
import aaanalysis as aa
import aaanalysis.pipe as aap

aa.options["verbose"] = False
aa.plot_settings()

# Sequences (df_seq) + labels and a discovered feature set (df_feat, e.g. from aap.find_features)
df_seq = aa.load_dataset(name="DOM_GSEC", n=20)
labels = df_seq["label"].to_list()
df_feat = aa.load_features().head(25)

aa.display_df(df_feat, n_rows=10, show_shape=True)
DataFrame shape: (25, 15)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 0.970400 1.438918
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 0.000000 0.000000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 1.554800 2.109848
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 3.111200 3.109955
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 0.000000 0.000000
6 TMD_C_JMD_C-Seg...2,3)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.222000 0.058671 0.058671 0.064895 0.069547 0.000000 0.000001 27,28,29,30,31,32,33 0.000000 0.000000
7 TMD_C_JMD_C-Seg...4,5)-FAUJ880109 Energy Isoelectric point Number hydrogen bond donors Number of hydro...e et al., 1988) 0.215000 0.146661 0.146661 0.174609 0.188034 0.000000 0.000004 33,34,35,36 1.032400 1.510722
8 TMD_C_JMD_C-Seg...3,4)-JANJ780101 ASA/Volume Accessible surface area (ASA) ASA (folded protein) Average accessi...n et al., 1978) 0.215000 0.124317 0.124317 0.166309 0.153364 0.000000 0.000004 31,32,33,34,35 1.080400 1.296094
9 TMD_C_JMD_C-Seg...,10)-WILM950103 Polarity Hydrophobicity (interface) Hydrophobicity (interface) Hydrophobicity ...e et al., 1995) 0.212000 0.141305 -0.141305 0.168603 0.217235 0.000000 0.000005 33,34 1.747200 2.150664
10 TMD_C_JMD_C-Seg...6,9)-AURR980110 Conformation α-helix α-helix (middle) Normalized posi...ora-Rose, 1998) 0.211000 0.125350 0.125350 0.160819 0.174121 0.000000 0.000005 32,33 1.788800 2.700803

By default a single sample is explained: the label_target_class sample the models predict most confidently — the most representative correct prediction. The per-sample SHAP impact is added to df_feat as a feat_impact_'entry' column, and the feature map colours each feature by that signed impact. random_state makes the SHAP estimation and the sample selection reproducible; n_jobs parallelizes the feature-matrix build:

df_shap, ax, evals = aap.explain_features(df_feat, df_seq, labels,
                                          random_state=42, n_jobs=1)

impact_cols = [c for c in df_shap.columns if c.startswith("feat_impact_")]
print("impact columns:", impact_cols, "| evals:", evals)
plt.tight_layout()
plt.show()
impact columns: ['feat_impact_Q06481'] | evals: None
../_images/aap_explain_features_1_output_3_1.png

Pass samples to explain chosen sample(s) instead of auto-selecting — an entry name, a row position, or a list of them. label_target_class sets the class SHAP targets, list_model_classes overrides the prediction models, and plot=False skips the figure (ax is then None). Here we explain two named proteins with an explicit model list:

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

entries = df_seq["entry"].iloc[:2].to_list()
df_shap, ax, _ = aap.explain_features(df_feat, df_seq, labels,
                                      samples=entries,
                                      label_target_class=1,
                                      list_model_classes=[RandomForestClassifier, ExtraTreesClassifier],
                                      plot=False, random_state=42, n_jobs=1, verbose=False)

cols = ["feature"] + [c for c in df_shap.columns if c.startswith("feat_impact_")]
aa.display_df(df_shap[cols], n_rows=10, show_shape=True)
DataFrame shape: (25, 3)
  feature feat_impact_Q14802 feat_impact_Q86UE4
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 -0.110000 -2.300000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 -0.520000 -2.370000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 -7.850000 -4.720000
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 -5.370000 -4.380000
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 -9.310000 -5.250000
6 TMD_C_JMD_C-Seg...2,3)-KLEP840101 0.090000 -2.490000
7 TMD_C_JMD_C-Seg...4,5)-FAUJ880109 -6.880000 -7.960000
8 TMD_C_JMD_C-Seg...3,4)-JANJ780101 -1.560000 -0.650000
9 TMD_C_JMD_C-Seg...,10)-WILM950103 -7.300000 -6.520000
10 TMD_C_JMD_C-Seg...6,9)-AURR980110 -11.270000 -0.040000

With plot=True the SHAP-coloured feature map is drawn for the first requested sample; name_test and name_ref label the two groups. The returned ax is the feature-map Axes:

df_shap, ax, _ = aap.explain_features(df_feat, df_seq, labels,
                                      samples=df_seq["entry"].iloc[0],
                                      name_test="SUBSTRATE", name_ref="NON-SUB",
                                      plot=True, random_state=42, n_jobs=1)

plt.tight_layout()
plt.show()
../_images/aap_explain_features_2_output_7_0.png