TreeModel.fit

TreeModel.fit(X, labels, n_rounds=5, use_rfe=False, n_cv=5, n_feat_min=25, n_feat_max=75, metric='accuracy', step=None)[source]

Fit tree-based models and compute average feature importance [Breimann25].

The feature importance is calculated across all models and rounds. In each round, the set of features can optionally be prefiltered using Recursive Feature Elimination (RFE) with a default RandomForestClassifier, if use_rfe is set to True. This RFE process iteratively reduces the number of features to enhance model performance, guided by the metric specified in metric. The reduction continues until reaching n_feat_min, with an upper limit of n_feat_max features considered.

Added in version 0.1.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).
n_rounds (int, default=5) – The number of rounds (>=1) to fit the model.
use_rfe (bool, default=False) – If True, recursive feature elimination (RFE) is used with random forest model for feature selection.
n_cv (int, default=5) – The number of cross-validation folds for RFE, must be > 1 and ≤ the smallest class’s sample count.
n_feat_min (int, default=25) – The minimum number of features to select each round for RFE. Should be 0 < n_feat_min <= n_feat_max.
n_feat_max (int, default=75) – The maximum number of features to select each round for RFE. Should be >= n_feat_min.
metric (str, default="accuracy") – The name of the scoring function to use for cross-validation for RFE. Valid metrics are: {‘accuracy’, ‘balanced_accuracy’, ‘precision’, ‘recall’, ‘f1’, ‘roc_auc’}
step (int, optional) – Number of features to remove per RFE iteration. If None, removes all features with the lowest importance scores each iteration, offering a faster but less precise approach. Should be < n_features. Ideally, step should be much smaller than n_features, with 1-5% of n_features being a recommended range.

Returns:

The fitted TreeModel instance.

Return type:

TreeModel

See also

[Breimann25] describes recursive feature elimination algorithm and feature importance aggregation.
sklearn.ensemble.RandomForestClassifier for the random forest model used with default settings for recursive feature elimination.
sklearn.feature_selection.RFECV for similar cross-validation based recursive feature elimination algorithm. This one does not provide an upper limit for the number of features to select.
sklearn.model_selection.cross_validate() for details on cross-validation
Sckit-learn cross-validation documentation.
Sckit-learn classification metrics and scorings.

Examples

To demonstrate the TreeModel().fit()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(10)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can now create a TreeModel object and fit it to obtain the importance of each feature and their standard deviation using the feat_importance and feat_importance_std attributes:

tm = aa.TreeModel()
tm.fit(X, labels=labels)

feat_importance = tm.feat_importance
feat_importance_std = tm.feat_importance_std

print("Feature importance: ", feat_importance)
print("Their STD: ", feat_importance_std)

Feature importance:  [ 6.345  9.741 12.062 11.896  9.537  6.134  8.117  9.133 11.615 15.42 ]
Their STD:  [0.29  0.805 0.406 1.038 0.872 0.431 0.452 0.325 0.2   0.847]

To obtain Monte Carlo estimates of the feature importance, the TreeModel().fit() method performs 5 rounds of model fitting and averages the feature importance across all rounds. The number of rounds can be adjusted using the n_rounds (default=5) parameter:

tm = aa.TreeModel()
tm.fit(X, labels=labels, n_rounds=1)

feat_importance = tm.feat_importance
feat_importance_std = tm.feat_importance_std

print("Feature importance: ", feat_importance)
print("Their STD: ", feat_importance_std)

Feature importance:  [ 7.976 10.327 12.49  11.926  9.249  6.877  7.583  8.292 10.899 14.38 ]
Their STD:  [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

Moreover, it applies a recursive feature elimination (RFE) algorithm, which can be disabled by setting use_rfe=False:

tm.fit(X, labels=labels, use_rfe=False)
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 6.156  9.613 12.097 11.915  9.932  6.334  8.121  9.255 12.036 14.541]

The number of features selected per round is controlled by the n_feat_min and n_feat_max parameters:

tm.fit(X, labels=labels, n_feat_min=1, n_feat_max=3)
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 6.882  9.79  11.114 12.435 10.012  6.128  7.932  9.254 11.346 15.107]

The performance measure for the evaluation during each RFE iteration can be set by the metric parameter (default=accuracy):

tm.fit(X, labels=labels, metric="recall")
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 6.505 10.099 11.17  11.961  9.985  5.858  8.317  9.857 11.882 14.367]

The features eliminated in each step is controlled by the step parameter (default=1), which can be set to None to remove in each iteration all features with the lowest importance. This offers a faster but less precise approach:

tm.fit(X, labels=labels, step=None)
feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)

Feature importance:  [ 6.893  9.862 11.45  11.73  10.083  6.217  7.902  9.432 11.189 15.242]

Further parameters. TreeModel.fit also accepts: n_cv — The number of cross-validation folds for RFE, must be > 1 and ≤ the smallest class’s sample count.

# Further parameters: n_cv sets the number of cross-validation folds used during RFE.
tm.fit(X, labels=labels, n_cv=3)
feat_importance = tm.feat_importance
print("Feature importance:", feat_importance)

Feature importance: [ 6.075 10.037 11.122 12.446 10.511  6.141  8.13   9.118 11.416 15.005]