aaanalysis.TreeModel.add_feat_importance
- TreeModel.add_feat_importance(df_feat=None, drop=False)[source]
Include feature importance and its standard deviation to feature DataFrame.
Feature importance is included as
feat_importancecolumn and the standard deviation of the feature importance asfeat_importance_stdcolumn.- Parameters:
df_feat (pd.DataFrame, shape (n_features, n_feature_info)) – Feature DataFrame with a unique identifier, scale information, statistics, and positions for each feature.
drop (bool, default=False) – If
True, allow dropping of already existingfeat_importanceandfeat_importance_stdcolumns fromdf_featbefore inserting.
- Returns:
df_feat – Feature DataFrame including
feat_importanceandfeat_importance_stdcolumns.- Return type:
pd.DataFrame, shape (n_features, n_feature_info+2)
See also
CPP.run()for details on CPP statistical measures of feature DataFrame.
Examples
To demonstrate the
TreeModel().add_feat_importance()method, we obtain theDOM_GSECexample dataset and its respective feature set (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False # Disable verbosity df_seq = aa.load_dataset(name="DOM_GSEC") labels = df_seq["label"].to_list() df_feat = aa.load_features(name="DOM_GSEC").head(7) # Create feature matrix sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) X = sf.feature_matrix(features=df_feat["feature"], df_parts=df_parts)
We can not fit the
TreeModel, which will internally fit 3 tree-based models over 5 training rounds be default:tm = aa.TreeModel() tm = tm.fit(X, labels=labels)
We can directly retrieve the feature importance using the
feat_importanceattribute of theTreeModelclass:feat_importance = tm.feat_importance print("Feature importance: ", feat_importance)Feature importance: [ 8.753 12.994 18.623 20.081 16.802 9.643 13.105]
To add these values to the feature DataFrame (
df_feat), it should not already contain thefeat_importanceandfeat_importance_stdcolumns:# Remove feature importance columns df_feat = df_feat[[x for x in list(df_feat) if x not in ["feat_importance", "feat_importance_std"]]]
Now the importance obtain from the fitted model can be inserted with the conventional column names by using the
TreeModel().add_feat_importance()method:df_feat = tm.add_feat_importance(df_feat=df_feat) aa.display_df(df_feat)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 8.753000 1.227000 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 12.994000 0.960000 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 18.623000 0.775000 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 20.081000 0.555000 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 16.802000 0.673000 6 TMD_C_JMD_C-Seg...2,3)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.222000 0.058671 0.058671 0.064895 0.069547 0.000000 0.000001 27,28,29,30,31,32,33 9.643000 0.280000 7 TMD_C_JMD_C-Seg...4,5)-FAUJ880109 Energy Isoelectric point Number hydrogen bond donors Number of hydro...e et al., 1988) 0.215000 0.146661 0.146661 0.174609 0.188034 0.000000 0.000004 33,34,35,36 13.105000 0.324000 To override already existing feature importance columns, set
drop=True:# Drop existing feature columns and insert new ones df_feat = tm.add_feat_importance(df_feat=df_feat, drop=True) aa.display_df(df_feat)
feature category subcategory scale_name scale_description abs_auc abs_mean_dif mean_dif std_test std_ref p_val_mann_whitney p_val_fdr_bh positions feat_importance feat_importance_std 1 TMD_C_JMD_C-Seg...3,4)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.244000 0.103666 0.103666 0.106692 0.110506 0.000000 0.000000 31,32,33,34,35 8.753000 1.227000 2 TMD_C_JMD_C-Seg...3,4)-FINA910104 Conformation α-helix (C-cap) α-helix termination Helix terminati...n et al., 1991) 0.243000 0.085064 0.085064 0.098774 0.096946 0.000000 0.000000 31,32,33,34,35 12.994000 0.960000 3 TMD_C_JMD_C-Seg...6,9)-LEVM760105 Shape Side chain length Side chain length Radius of gyrat... (Levitt, 1976) 0.233000 0.137044 0.137044 0.161683 0.176964 0.000000 0.000001 32,33 18.623000 0.775000 4 TMD_C_JMD_C-Seg...3,4)-HUTJ700102 Energy Entropy Entropy Absolute entrop...Hutchens, 1970) 0.229000 0.098224 0.098224 0.106865 0.124608 0.000000 0.000001 31,32,33,34,35 20.081000 0.555000 5 TMD_C_JMD_C-Seg...6,9)-RADA880106 ASA/Volume Volume Accessible surface area (ASA) Accessible surf...olfenden, 1988) 0.223000 0.095071 0.095071 0.114758 0.132829 0.000000 0.000002 32,33 16.802000 0.673000 6 TMD_C_JMD_C-Seg...2,3)-KLEP840101 Energy Charge Charge Net charge (Kle...n et al., 1984) 0.222000 0.058671 0.058671 0.064895 0.069547 0.000000 0.000001 27,28,29,30,31,32,33 9.643000 0.280000 7 TMD_C_JMD_C-Seg...4,5)-FAUJ880109 Energy Isoelectric point Number hydrogen bond donors Number of hydro...e et al., 1988) 0.215000 0.146661 0.146661 0.174609 0.188034 0.000000 0.000004 33,34,35,36 13.105000 0.324000