aaanalysis.TreeModel.eval

TreeModel.eval(X, labels=None, list_is_selected=None, convert_1d_to_2d=False, names_feature_selections=None, n_cv=5, list_metrics=None)[source]

Evaluate the prediction performance for different feature selections.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.

  • labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).

  • list_is_selected (array-like, shape (n_feature_sets, n_round, n_features)) – List of 2D boolean arrays with shape (n_rounds, n_features) indicating different feature selections.

  • convert_1d_to_2d (bool, default=False) – If True, convert all boolean arrays in list_is_selected from 1D to 2D arrays with a single row.

  • names_feature_selections (list of str, optional) – List of dataset names corresponding to list_is_feature.

  • n_cv (int, default=5) – The number of cross-validation folds for RFE, must be > 1 and ≤ the smallest class’s sample count.

  • list_metrics (str or list of str, default=['accuracy', 'precision', 'recall', 'f1']) – List of scoring metrics to use for evaluation. Only metrics for binary classification are allowed. Valid metrics are: {‘accuracy’, ‘balanced_accuracy’, ‘precision’, ‘recall’, ‘f1’, ‘roc_auc’}

Returns:

df_eval – Evaluation results for feature subsets obtained by recursive feature selection give with list_is_feature.

Return type:

pd.DataFrame

Notes

Warning

  • list_metrics: ‘precision’ and ‘f1’ metrics may trigger ‘UndefinedMetricWarning’ in imbalanced or small datasets due to division by zero if no positive predictions can be maid.

See also

Examples

To demonstrate the TreeModel().eval()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can now create two feature selections using the is_preselected parameter of the TreeModel class and its .fit() method:

import numpy as np

tm = aa.TreeModel()
is_selected = tm.fit(X=X, labels=labels).is_selected_

# Pre-selected from top 20
is_preselected_top20 = np.asarray(df_feat.index < 20)
tm = aa.TreeModel(is_preselected=is_preselected_top20)
is_selected_top20 = tm.fit(X=X, labels=labels).is_selected_

To evaluate different feature selections, provide X, labels, and the feature selection in terms of boolean 2D arrays using the list_is_selected parameters:

list_is_selected = [is_selected, is_selected_top20]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected)
aa.display_df(df_eval)
  name accuracy precision recall f1
1 Set 1 0.863500 0.846300 0.895000 0.865000
2 Set 2 0.814900 0.814500 0.833300 0.822000

You can also use 1D boolean masks by setting convert_1d_to_2d=True. To demonstrate this we create three different boolean masks based on different scale categories:

mask_volume = np.asarray(df_feat["category"] == "ASA/Volume")
mask_conformation = np.asarray(df_feat["category"] == "Conformation")
mask_energy = np.asarray(df_feat["category"] == "Energy")

list_is_selected = [mask_volume, mask_conformation, mask_energy]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True)
aa.display_df(df_eval)
  name accuracy precision recall f1
1 Set 1 0.813800 0.839500 0.817900 0.821900
2 Set 2 0.838500 0.854700 0.867300 0.840300
3 Set 3 0.825700 0.822600 0.866700 0.826100

Provide the names of the feature selections using the names_feature_selections parameter:

names_feature_selections = ["ASA/Volume", "Conformation", "Energy"]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, names_feature_selections=names_feature_selections)
aa.display_df(df_eval)
  name accuracy precision recall f1
1 ASA/Volume 0.834000 0.821600 0.802600 0.820800
2 Conformation 0.838300 0.853400 0.906400 0.854700
3 Energy 0.829700 0.839400 0.834000 0.820900

The evaluation strategy can be adjusting by changing the number cross-validation folds (n_cv, default=5) and the scoring metrics via the list_metrics parameter (default=[“accuracy”, “recall”, “precision”, “f1”]):

list_metrics = ["balanced_accuracy", "roc_auc"]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, list_metrics=list_metrics)
aa.display_df(df_eval)
  name balanced_accuracy roc_auc
1 Set 1 0.834900 0.886700
2 Set 2 0.840100 0.954400
3 Set 3 0.832100 0.911800