TreeModel

class TreeModel(list_model_classes=None, list_model_kwargs=None, is_preselected=None, verbose=True, random_state=None)[source]

Bases: Wrapper

Tree Model class: A wrapper for tree-based models to obtain Monte Carlo estimates of feature importance and predictions [Breimann25].

As a Wrapper, it implements the .fit / .eval model contract.

Monte Carlo estimates are derived by averaging feature importance or prediction probabilities across various tree-based models and training rounds, enhancing the robustness and reproducibility of these estimates. Additionally, the class supports feature selection through recursive feature elimination (RFE) and offers comprehensive evaluation of feature selections.

Added in version 0.1.3.

list_models_

List with fitted tree-based models for every round after calling the fit method.

Type:: Nested list with objects, shape (n_rounds, n_models)

feat_importance

An array containing importance of each feature averaged across all rounds and trained models from list_model_classes. Note: unlike sklearn convention, this attribute does not carry a trailing underscore.

Type:: array-like, shape (n_features)

feat_importance_std

An array containing standard deviation for feature importance across all rounds and trained models from list_model_classes. Same order as feat_importance. Note: unlike sklearn convention, this attribute does not carry a trailing underscore.

Type:: array-like, shape (n_features)

is_selected_

2D array indicating features being selected by recursive features selection (True) or not (False) for each round. Same order as feat_importance.

Type:: array-like, shape (n_rounds, n_features)

Parameters:

list_model_classes (Optional[List[Type[Union[ClassifierMixin, BaseEstimator]]]])
list_model_kwargs (Optional[List[Dict]])
is_preselected (Union[Sequence[Union[int, float]], ndarray, Series, None])
verbose (bool)
random_state (Optional[int])

Methods

`add_feat_importance`(df_feat[, drop, sort])	Include feature importance and its standard deviation to feature DataFrame.
`eval`(X, labels, list_is_selected[, ...])	Evaluate the prediction performance for different feature selections.
`fit`(X, labels[, n_rounds, use_rfe, n_cv, ...])	Fit tree-based models and compute average feature importance [Breimann25].
`predict_proba`(X)	Obtain Monte Carlo estimate of class prediction probabilities for the positive class in X.
`select_features`(df_feat, strategy, param)	Select a subset of features from a feature DataFrame using tree-based feature importance.

__init__(list_model_classes=None, list_model_kwargs=None, is_preselected=None, verbose=True, random_state=None)[source]

Parameters:

list_model_classes (list of Type[ClassifierMixin or BaseEstimator], default=[RandomForestClassifier, ExtraTreesClassifier]) – A list of tree-based model classes to be used for feature importance analysis.
list_model_kwargs (list of dict, optional) – A list of dictionaries containing keyword arguments for each model in list_model_classes.
is_preselected (array-like, shape (n_features)) – Boolean array indicating features being preselected before applying recursive features selection. True indicates that a feature is preselected and False that it is not.
verbose (bool, default=True) – If True, verbose outputs are enabled.
random_state (int, optional) – The seed used by the random number generator. If a positive integer, results of stochastic processes are consistent, enabling reproducibility. If None, stochastic processes will be truly random.

Notes

All attributes are set during fitting via the TreeModel.fit() method and can be directly accessed.

See also

sklearn.ensemble.RandomForestClassifier for random forest model.
sklearn.ensemble.ExtraTreesClassifier for extra trees model.

Examples

The TreeModel object can be instantiated without providing any parameter:

import aaanalysis as aa
tm = aa.TreeModel()

You can provide a list of tree-based models and their respective arguments using the list_model_classes and list_model_kwargs parameters:

from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier

# Classes used as default
list_model_classes = [RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier]
print("Default model arguments: ", tm._list_model_kwargs)

# Adjust default parameters
list_model_kwargs = [dict(n_estimators=64)] * 3
tm = aa.TreeModel(list_model_classes=list_model_classes, list_model_kwargs=list_model_kwargs)
print("New model arguments: ", tm._list_model_kwargs)

Default model arguments:  [{'random_state': None}, {'random_state': None}]
New model arguments:  [{'n_estimators': 64, 'random_state': None}, {'n_estimators': 64, 'random_state': None}, {'n_estimators': 64, 'random_state': None}]

You can set the random_state and verbose parameters:

# Set random sed and disable verbosity
tm = aa.TreeModel(random_state=42, verbose=False)
print("New model arguments: ", tm._list_model_kwargs)

New model arguments:  [{'random_state': 42}, {'random_state': 42}]

You compare different feature pre-filtering strategies by utilizing the is_preselected parameter, which we will demonstrate using the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import numpy as np
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

# Pre-select top 10 and top 50 features
mask_top10 = np.asarray(df_feat.index < 10)
mask_top50 = np.asarray(df_feat.index < 50)

We can now compare the prediction performance for these preselected feature sets using the TreeModel().eval() method:

df_eval = tm.eval(X, labels=labels, list_is_selected=[np.array([mask_top10]), np.array([mask_top50])])
aa.display_df(df_eval)

	name	accuracy	precision	recall	f1
1	Set 1	0.758200	0.767700	0.761500	0.756700
2	Set 2	0.842200	0.838600	0.875000	0.849000

Further parameters. The constructor’s is_preselected restricts every fitting round to a pre-filtered feature subset, letting you compare pre-filtering strategies against the full feature set.

# is_preselected restricts every fitting round to a pre-filtered feature subset (here the top 10)
tm = aa.TreeModel(is_preselected=mask_top10)
is_selected_top10 = tm.fit(X=X, labels=labels).is_selected_
print("Features selected within the top-10 pre-selection:", int(np.asarray(is_selected_top10).sum()))

Features selected within the top-10 pre-selection: 50

property feature_importances_: ndarray[source]

Averaged feature importances in scikit-learn convention (1-D, summing to ~1).

Alias of feat_importance (expressed in percent) rescaled to fractions and exposed under the standard scikit-learn name, so the Monte-Carlo importances compose with tools that read feature_importances_ (e.g. ranking / selection utilities). Available after fit().

Returns:: feature_importances_ – Mean importance per feature as a fraction (feat_importance / 100), same order as feat_importance.
Return type:: array-like, shape (n_features,)