aaanalysis.TreeModel

class aaanalysis.TreeModel(list_model_classes=None, list_model_kwargs=None, is_preselected=None, verbose=True, random_state=None)[source]

Bases: object

Tree Model class: A wrapper for tree-based models to obtain Monte Carlo estimates of feature importance and predictions [Breimann25a].

Monte Carlo estimates are derived by averaging feature importance or prediction probabilities across various tree-based models and training rounds, enhancing the robustness and reproducibility of these estimates. Additionally, the class supports feature selection through recursive feature elimination (RFE) and offers comprehensive evaluation of feature selections.

Added in version 0.1.3.

list_models_

List with fitted tree-based models for every round after calling the fit method.

Type:

Nested list with objects, shape (n_rounds, n_models)

feat_importance

An array containing importance of each feature averaged across all rounds and trained models from list_model_classes.

Type:

array-like, shape (n_features)

feat_importance_std

An array containing standard deviation for feature importance across all rounds and trained models from list_model_classes. Same order as feature_importance.

Type:

array-like, shape (n_features)

is_selected_

2D array indicating features being selected by recursive features selection (True) or not (False) for each round. Same order as feature_importance.

Type:

array-like, shape (n_rounds, n_features)

Parameters:

Methods

add_feat_importance([df_feat, drop])

Include feature importance and its standard deviation to feature DataFrame.

eval(X[, labels, list_is_selected, ...])

Evaluate the prediction performance for different feature selections.

fit(X[, labels, n_rounds, use_rfe, n_cv, ...])

Fit tree-based models and compute average feature importance [Breimann25a].

predict_proba(X)

Obtain Monte Carlo estimate of class prediction probabilities for the positive class in X.

__init__(list_model_classes=None, list_model_kwargs=None, is_preselected=None, verbose=True, random_state=None)[source]
Parameters:
  • list_model_classes (list of Type[ClassifierMixin or BaseEstimator], default=[RandomForestClassifier, ExtraTreesClassifier]) – A list of tree-based model classes to be used for feature importance analysis.

  • list_model_kwargs (list of dict, optional) – A list of dictionaries containing keyword arguments for each model in list_model_classes.

  • is_preselected (array-like, shape (n_features)) – Boolean array indicating features being preselected before applying recursive features selection. True indicates that a feature is preselected and False that it is not.

  • verbose (bool, default=True) – If True, verbose outputs are enabled.

  • random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random.

Notes

  • All attributes are set during fitting via the TreeModel.fit() method and can be directly accessed.

See also

Warning

  • This class belongs to the explainable AI module requiring SHAP, which is automatically installed via pip install aaanalysis[pro].

Examples

The TreeModel object can be instantiated without providing any parameter:

import aaanalysis as aa
tm = aa.TreeModel()

You can provide a list of tree-based models and their respective arguments using the list_model_classes and list_model_kwargs parameters:

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

# Classes used as default
list_model_classes = [RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier]
print("Default model arguments: ", tm._list_model_kwargs)

# Adjust default parameters
list_model_kwargs = [dict(n_estimators=64)] * 3
tm = aa.TreeModel(list_model_classes=list_model_classes, list_model_kwargs=list_model_kwargs)
print("New model arguments: ", tm._list_model_kwargs)
Default model arguments:  [{'random_state': None}, {'random_state': None}]
New model arguments:  [{'n_estimators': 64, 'random_state': None}, {'n_estimators': 64, 'random_state': None}, {'n_estimators': 64, 'random_state': None}]

You can set the random_state and verbose parameters:

# Set random sed and disable verbosity
tm = aa.TreeModel(random_state=42, verbose=False)
print("New model arguments: ", tm._list_model_kwargs)
New model arguments:  [{'random_state': 42}, {'random_state': 42}]

You compare different feature pre-filtering strategies by utilizing the is_preselected parameter, which we will demonstrate using the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):

import numpy as np
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

# Pre-select top 10 and top 50 features
mask_top10 = np.asarray(df_feat.index < 10)
mask_top50 = np.asarray(df_feat.index < 50)

We can now compare the prediction performance for these preselected feature sets using the TreeModel().eval() method:

df_eval = tm.eval(X, labels=labels, list_is_selected=[np.array([mask_top10]), np.array([mask_top50])])
aa.display_df(df_eval)
  name accuracy precision recall f1
1 Set 1 0.762200 0.769900 0.769200 0.762600
2 Set 2 0.842200 0.838600 0.875000 0.849000