aaanalysis.TreeModel.add_feat_importance

TreeModel.add_feat_importance(df_feat=None, drop=False)[source]

Include feature importance and its standard deviation to feature DataFrame.

Feature importance is included as feat_importance column and the standard deviation of the feature importance as feat_importance_std column.

Parameters:
  • df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.

  • drop (bool, default=False) – If True, allow dropping of already existing feat_importance and feat_importance_std columns from df_feat before inserting.

Returns:

df_feat – Feature DataFrame including feat_importance and feat_importance_std columns.

Return type:

pd.DataFrame, shape (n_features, n_feature_info+2)

See also

  • CPP.run() for details on CPP statistical measures of feature DataFrame.

Examples

To demonstrate the TreeModel().add_feat_importance()method, we obtain the DOM_GSEC example dataset and its respective feature set (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False # Disable verbosity

df_seq = aa.load_dataset(name="DOM_GSEC")
labels = df_seq["label"].to_list()
df_feat = aa.load_features(name="DOM_GSEC").head(7)

# Create feature matrix
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)

We can not fit the TreeModel, which will internally fit 3 tree-based models over 5 training rounds be default:

tm = aa.TreeModel()
tm = tm.fit(X, labels=labels)

We can directly retrieve the feature importance using the feat_importance attribute of the TreeModel class:

feat_importance = tm.feat_importance
print("Feature importance: ", feat_importance)
Feature importance:  [ 8.753 12.994 18.623 20.081 16.802  9.643 13.105]

To add these values to the feature DataFrame (df_feat), it should not already contain the feat_importance and feat_importance_std columns:

# Remove feature importance columns
df_feat = df_feat[[x for x in list(df_feat) if x not in ["feat_importance", "feat_importance_std"]]]

Now the importance obtain from the fitted model can be inserted with the conventional column names by using the TreeModel().add_feat_importance() method:

df_feat = tm.add_feat_importance(df_feat=df_feat)
aa.display_df(df_feat)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 8.753000 1.227000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 12.994000 0.960000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 18.623000 0.775000
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 20.081000 0.555000
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 16.802000 0.673000
6 TMD_C_JMD_C-Seg...2,3)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.222000 0.058671 0.058671 0.064895 0.069547 0.000000 0.000001 27,28,29,30,31,32,33 9.643000 0.280000
7 TMD_C_JMD_C-Seg...4,5)-FAUJ880109 Energy Isoelectric point Number hydrogen bond donors Number of hydro...e et al., 1988) 0.215000 0.146661 0.146661 0.174609 0.188034 0.000000 0.000004 33,34,35,36 13.105000 0.324000

To override already existing feature importance columns, set drop=True:

# Drop existing feature columns and insert new ones
df_feat = tm.add_feat_importance(df_feat=df_feat, drop=True)
aa.display_df(df_feat)
  feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std
1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 8.753000 1.227000
2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 12.994000 0.960000
3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 18.623000 0.775000
4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 20.081000 0.555000
5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 16.802000 0.673000
6 TMD_C_JMD_C-Seg...2,3)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.222000 0.058671 0.058671 0.064895 0.069547 0.000000 0.000001 27,28,29,30,31,32,33 9.643000 0.280000
7 TMD_C_JMD_C-Seg...4,5)-FAUJ880109 Energy Isoelectric point Number hydrogen bond donors Number of hydro...e et al., 1988) 0.215000 0.146661 0.146661 0.174609 0.188034 0.000000 0.000004 33,34,35,36 13.105000 0.324000