aaanalysis.TreeModel.fit
- TreeModel.fit(X, labels=None, n_rounds=5, use_rfe=False, n_cv=5, n_feat_min=25, n_feat_max=75, metric='accuracy', step=None)[source]
Fit tree-based models and compute average feature importance [Breimann25a].
The feature importance is calculated across all models and rounds. In each round, the set of features can optionally be prefiltered using Recursive Feature Elimination (RFE) with a default RandomForestClassifier, if
use_rfeis set to True. This RFE process iteratively reduces the number of features to enhance model performance, guided by the metric specified inmetric. The reduction continues until reachingn_feat_min, with an upper limit ofn_feat_maxfeatures considered.- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in
X(typically, 1=positive, 0=negative).n_rounds (int, default=5) – The number of rounds (>=1) to fit the model.
use_rfe (bool, default=False) – If
True, recursive feature elimination (RFE) is used with random forest model for feature selection.n_cv (int, default=5) – The number of cross-validation folds for RFE, must be > 1 and ≤ the smallest class’s sample count.
n_feat_min (int, default=25) – The minimum number of features to select each round for RFE. Should be 0 <
n_feat_min<=n_feat_max.n_feat_max (int, default=75) – The maximum number of features to select each round for RFE. Should be >=
n_feat_min.metric (str, default="accuracy") – The name of the scoring function to use for cross-validation for RFE. Valid metrics are: {‘accuracy’, ‘balanced_accuracy’, ‘precision’, ‘recall’, ‘f1’, ‘roc_auc’}
step (int, optional) – Number of features to remove per RFE iteration. If
None, removes all features with the lowest importance scores each iteration, offering a faster but less precise approach. Should be <n_features. Ideally,stepshould be much smaller thann_features, with 1-5% of n_features being a recommended range.
- Returns:
The fitted TreeModel instance.
- Return type:
See also
[Breimann25a] describes recursive feature elimination algorithm and feature importance aggregation.
sklearn.ensemble.RandomForestClassifierfor the random forest model used with default settings for recursive feature elimination.sklearn.feature_selection.RFECVfor similar cross-validation based recursive feature elimination algorithm. This one does not provide an upper limit for the number of features to select.sklearn.model_selection.cross_validate()for details on cross-validationSckit-learn cross-validation documentation.
Sckit-learn classification metrics and scorings.
Examples
To demonstrate the
TreeModel().fit()method, we obtain theDOM_GSECexample dataset and its respective feature set (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC") labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(10) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)
We can now create a
TreeModelobject and fit it to obtain the importance of each feature and their standard deviation using thefeat_importanceandfeat_importance_stdattributes:tm = aa.TreeModel() tm.fit(X, labels=labels) feat_importance = tm.feat_importance feat_importance_std = tm.feat_importance_std print("Feature importance: ", feat_importance) print("Their STD: ", feat_importance_std)Feature importance: [ 6.294 9.552 11.667 12.728 9.545 5.666 8.222 9.047 11.695 15.584] Their STD: [0.452 0.892 0.913 0.807 0.89 0.459 0.353 0.19 0.463 0.98 ]
To obtain Monte Carlo estimates of the feature importance, the
TreeModel().fit()method performs 5 rounds of model fitting and averages the feature importance across all rounds. The number of rounds can be adjusted using then_rounds(default=5) parameter:tm = aa.TreeModel() tm.fit(X, labels=labels, n_rounds=1) feat_importance = tm.feat_importance feat_importance_std = tm.feat_importance_std print("Feature importance: ", feat_importance) print("Their STD: ", feat_importance_std)Feature importance: [ 7.307 8.944 11.914 12.237 10.92 6.139 7.903 8.776 10.644 15.215] Their STD: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
Moreover, it applies a recursive feature elimination (RFE) algorithm, which can be disabled by setting
use_rfe=False:tm.fit(X, labels=labels, use_rfe=False) feat_importance = tm.feat_importance print("Feature importance: ", feat_importance)Feature importance: [ 6.054 10.371 11.681 11.519 10.112 6.028 8.332 9.352 10.814 15.738]
The number of features selected per round is controlled by the
n_feat_minandn_feat_maxparameters:tm.fit(X, labels=labels, n_feat_min=1, n_feat_max=3) feat_importance = tm.feat_importance print("Feature importance: ", feat_importance)Feature importance: [ 6.471 10.085 11.328 12.093 9.88 5.853 8.283 9.282 10.995 15.731]
The performance measure for the evaluation during each RFE iteration can be set by the
metricparameter (default=accuracy):tm.fit(X, labels=labels, metric="recall") feat_importance = tm.feat_importance print("Feature importance: ", feat_importance)Feature importance: [ 7.18 9.461 11.579 12.39 9.407 6.258 8.293 9.634 11.201 14.599]
The features eliminated in each step is controlled by the
stepparameter (default=1), which can be set toNoneto remove in each iteration all features with the lowest importance. This offers a faster but less precise approach:tm.fit(X, labels=labels, step=None) feat_importance = tm.feat_importance print("Feature importance: ", feat_importance)Feature importance: [ 6.955 9.854 11.818 12.225 9.785 5.923 7.901 9.162 11.722 14.654]