TreeModel.eval

TreeModel.eval(X, labels, list_is_selected, convert_1d_to_2d=False, names_feature_selections=None, n_cv=5, list_metrics=None)[source]

Evaluate the prediction performance for different feature selections.

For each boolean selection array in list_is_selected, the method applies k-fold cross-validation using the configured tree-based models and averages the resulting metric scores across all rounds and models. The output is a single DataFrame that lets you compare feature subsets side by side. Call TreeModel.fit() first to obtain is_selected_ arrays, then pass them here.

Added in version 0.1.0.

Parameters:

X (array-like, shape (n_samples, n_features)) – Feature matrix. Rows typically correspond to proteins and columns to features.
labels (array-like, shape (n_samples)) – Class labels for samples in X (typically, 1=positive, 0=negative).
list_is_selected (list of array-like, each of shape (n_rounds, n_features)) – List of 2D boolean arrays indicating different feature selections.
convert_1d_to_2d (bool, default=False) – If True, convert all boolean arrays in list_is_selected from 1D to 2D arrays with a single row.
names_feature_selections (list of str, optional) – List of dataset names corresponding to list_is_feature.
n_cv (int, default=5) – The number of cross-validation folds for Recursive Feature Elimination (RFE), must be > 1 and ≤ the smallest class’s sample count.
list_metrics (str or list of str, default=['accuracy', 'precision', 'recall', 'f1']) – List of scoring metrics to use for evaluation. Only metrics for binary classification are allowed. Valid metrics are: {‘accuracy’, ‘balanced_accuracy’, ‘precision’, ‘recall’, ‘f1’, ‘roc_auc’}

Returns:

df_eval – Evaluation results for feature subsets obtained by recursive feature selection given by list_is_selected. Columns include the feature selection name, cross-validation metric scores (mean and standard deviation), and the number of selected features.

Return type:

pd.DataFrame

Notes

sklearn.metrics.balanced_accuracy_score() is recommended if datasets are unbalanced.

Warning

list_metrics: ‘precision’ and ‘f1’ metrics may trigger ‘UndefinedMetricWarning’ in imbalanced or small datasets due to division by zero if no positive predictions can be maid.

See also

Sckit-learn classification metrics and scorings.

Examples

To demonstrate the TreeModel().eval()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(100)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can now create two feature selections using the is_preselected parameter of the TreeModel class and its .fit() method:

import numpy as np

tm = aa.TreeModel()
is_selected = tm.fit(X=X, labels=labels).is_selected_

# Pre-selected from top 20
is_preselected_top20 = np.asarray(df_feat.index < 20)
tm = aa.TreeModel(is_preselected=is_preselected_top20)
is_selected_top20 = tm.fit(X=X, labels=labels).is_selected_

To evaluate different feature selections, provide X, labels, and the feature selection in terms of boolean 2D arrays using the list_is_selected parameters:

list_is_selected = [is_selected, is_selected_top20]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected)
aa.display_df(df_eval)

	name	accuracy	precision	recall	f1
1	Set 1	0.859600	0.848500	0.895000	0.866600
2	Set 2	0.814100	0.820200	0.825300	0.816000

You can also use 1D boolean masks by setting convert_1d_to_2d=True. To demonstrate this we create three different boolean masks based on different scale categories:

mask_volume = np.asarray(df_feat["category"] == "ASA/Volume")
mask_conformation = np.asarray(df_feat["category"] == "Conformation")
mask_energy = np.asarray(df_feat["category"] == "Energy")

list_is_selected = [mask_volume, mask_conformation, mask_energy]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True)
aa.display_df(df_eval)

	name	accuracy	precision	recall	f1
1	Set 1	0.818000	0.827100	0.817300	0.808700
2	Set 2	0.838300	0.841700	0.898700	0.854200
3	Set 3	0.829500	0.837300	0.859600	0.823900

Provide the names of the feature selections using the names_feature_selections parameter:

names_feature_selections = ["ASA/Volume", "Conformation", "Energy"]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, names_feature_selections=names_feature_selections)
aa.display_df(df_eval)

	name	accuracy	precision	recall	f1
1	ASA/Volume	0.805800	0.826400	0.816700	0.806800
2	Conformation	0.830300	0.858800	0.883300	0.851500
3	Energy	0.813800	0.817300	0.851300	0.829800

The evaluation strategy can be adjusting by changing the number cross-validation folds (n_cv, default=5) and the scoring metrics via the list_metrics parameter (default=[“accuracy”, “recall”, “precision”, “f1”]):

list_metrics = ["balanced_accuracy", "roc_auc"]
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected, convert_1d_to_2d=True, list_metrics=list_metrics)
aa.display_df(df_eval)

	name	balanced_accuracy	roc_auc
1	Set 1	0.818900	0.876600
2	Set 2	0.843900	0.957900
3	Set 3	0.832100	0.911900

Further parameters. TreeModel.eval also accepts n_cv — the number of cross-validation folds used to score each feature selection (default 5).

# Further parameters: n_cv sets the number of cross-validation folds used to score each selection.
df_eval = tm.eval(X, labels=labels, list_is_selected=list_is_selected,
                  convert_1d_to_2d=True, n_cv=3)
aa.display_df(df_eval)

	name	accuracy	precision	recall	f1
1	Set 1	0.789700	0.799700	0.817500	0.788700
2	Set 2	0.825400	0.809800	0.841300	0.849300
3	Set 3	0.785700	0.820800	0.817500	0.791800