aaanalysis.TreeModel.eval
- TreeModel.eval(X, labels=None, list_is_selected=None, convert_1d_to_2d=False, names_feature_selections=None, n_cv=5, list_metrics=None)[source]
Evaluate the prediction performance for different feature selections.
- Parameters:
X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in
X(typically, 1=positive, 0=negative).list_is_selected (array-like, shape (n_feature_sets, n_round, n_features)) – List of 2D boolean arrays with shape (n_rounds, n_features) indicating different feature selections.
convert_1d_to_2d (bool, default=False) – If
True, convert all boolean arrays in list_is_selected from 1D to 2D arrays with a single row.names_feature_selections (list of str, optional) – List of dataset names corresponding to
list_is_feature.n_cv (int, default=5) – The number of cross-validation folds for RFE, must be > 1 and ≤ the smallest class’s sample count.
list_metrics (str or list of str, default=['accuracy', 'precision', 'recall', 'f1']) – List of scoring metrics to use for evaluation. Only metrics for binary classification are allowed. Valid metrics are: {‘accuracy’, ‘balanced_accuracy’, ‘precision’, ‘recall’, ‘f1’, ‘roc_auc’}
- Returns:
df_eval – Evaluation results for feature subsets obtained by recursive feature selection give with
list_is_feature.- Return type:
pd.DataFrame
Notes
sklearn.metrics.balanced_accuracy_score()is recommended if datasets are unbalanced.
Warning
list_metrics: ‘precision’ and ‘f1’ metrics may trigger ‘UndefinedMetricWarning’ in imbalanced or small datasets due to division by zero if no positive predictions can be maid.
See also
Sckit-learn classification metrics and scorings.
Examples
To demonstrate the
TreeModel().eval()method, we obtain theDOM_GSECexample dataset and its respective feature set (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC") labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(100) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)
We can now create two feature selections using the
is_preselectedparameter of theTreeModelclass and its.fit()method:import numpy as np tm = aa.TreeModel() is_selected = tm.fit(X=X, labels=labels).is_selected_ # Pre-selected from top 20 is_preselected_top20 = np.asarray(df_feat.index < 20) tm = aa.TreeModel(is_preselected=is_preselected_top20) is_selected_top20 = tm.fit(X=X, labels=labels).is_selected_
To evaluate different feature selections, provide
X,labels, and the feature selection in terms of boolean 2D arrays using thelist_is_selectedparameters:list_is_selected = [is_selected, is_selected_top20] df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected) aa.display_df(df_eval)
name accuracy precision recall f1 1 Set 1 0.863500 0.846300 0.895000 0.865000 2 Set 2 0.814900 0.814500 0.833300 0.822000 You can also use 1D boolean masks by setting
convert_1d_to_2d=True. To demonstrate this we create three different boolean masks based on different scale categories:mask_volume = np.asarray(df_feat["category"] == "ASA/Volume") mask_conformation = np.asarray(df_feat["category"] == "Conformation") mask_energy = np.asarray(df_feat["category"] == "Energy") list_is_selected = [mask_volume, mask_conformation, mask_energy] df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True) aa.display_df(df_eval)
name accuracy precision recall f1 1 Set 1 0.813800 0.839500 0.817900 0.821900 2 Set 2 0.838500 0.854700 0.867300 0.840300 3 Set 3 0.825700 0.822600 0.866700 0.826100 Provide the names of the feature selections using the
names_feature_selectionsparameter:names_feature_selections = ["ASA/Volume", "Conformation", "Energy"] df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, names_feature_selections=names_feature_selections) aa.display_df(df_eval)
name accuracy precision recall f1 1 ASA/Volume 0.834000 0.821600 0.802600 0.820800 2 Conformation 0.838300 0.853400 0.906400 0.854700 3 Energy 0.829700 0.839400 0.834000 0.820900 The evaluation strategy can be adjusting by changing the number cross-validation folds (
n_cv, default=5) and the scoring metrics via thelist_metricsparameter (default=[“accuracy”, “recall”, “precision”, “f1”]):list_metrics = ["balanced_accuracy", "roc_auc"] df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, list_metrics=list_metrics) aa.display_df(df_eval)
name balanced_accuracy roc_auc 1 Set 1 0.834900 0.886700 2 Set 2 0.840100 0.954400 3 Set 3 0.832100 0.911800