aaanalysis.CPP.eval
- CPP.eval(list_df_feat=None, labels=None, label_test=1, label_ref=0, min_th=0.0, names_feature_sets=None, list_cat=None, list_df_parts=None, n_jobs=1)[source]
Evaluate the quality of different sets of identified CPP features.
Feature sets are evaluated regarding two quality groups:
Discriminative Power: The capability of features to distinguish between test and reference datasets.
Redundancy: Assessed by the optimized number of clusters, based on Pearson correlation among features.
- Parameters:
list_df_feat (list of pd.DataFrames) – List of feature DataFrames each of shape (n_features, n_feature_info)
labels (array-like, shape (n_samples,)) – Class labels for samples in sequence DataFrame (typically, test=1, reference=0).
label_test (int, default=1,) – Class label of test group in
labels.label_ref (int, default=0,) – Class label of reference group in
labels.min_th (float, default=0.0) – Pearson correlation threshold for clustering optimization (between -1 and 1).
names_feature_sets (list of str, optional) – List of names for feature sets corresponding to
list_df_feat.list_cat (list of str, optional) – List of scale categories to retrieve number of features from. Default: [‘ASA/Volume’, ‘Composition’, ‘Conformation’, ‘Energy’, ‘Others’, ‘Polarity’, ‘Shape’, ‘Structure-Activity’]
list_df_parts (list of pd.DataFrames, optional) – List of part DataFrames each of shape (n_samples, n_parts). Must match with
list_df_feat.n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If
None, the number is optimized automatically. If-1, the number is set to all available cores. Overridden byoptions['n_jobs']when set.
- Returns:
df_eval – Evaluation results for each set of identified features. For each set, statistical measures were averaged across all features.
- Return type:
pd.DataFrame
Notes
df_evalincludes the following columns (upper-case indicates direct reference todf_featcolumns):‘name’: Name of the feature set, typically based on CPP run settings, if
namesis provided.‘n_features’: Tuple with total number of features and list of number of features per scale category from
list_cat.‘avg_ABS_AUC’: Absolute AUC averaged across all features.
‘range_ABS_AUC’: Quintile range of absolute AUC among all features (min, 25%, median, 75%, max).
‘avg_MEAN_DIF’: Tuple of mean differences averaged across all features separately for features with positive and negative ‘mean_dif’.
‘n_clusters’: Optimal number of clusters [2,100].
‘avg_n_feat_per_clust’: Average number of features per cluster.
‘std_n_feat_per_clust’: Standard deviation of feature number per cluster.
‘n_clusters’ is optimized for a KMeans clustering model based on the minimum Pearson correlation between the cluster center and all cluster members across all clusters (
min_cor_centerinAAclust), which has to exceed the minimum correlation thresholdmin_th.
See also
CPPPlot.eval(): the respective plotting method.AAontology: Classification of Amino Acid Scales for details on scale categories.
CPP.run()for details on CPP statistical measures.comp_auc_adjusted()for details on ‘abs_auc’.sklearn.cluster.KMeansfor employed clustering model.AAclust([Breimann24a]) for details on cluster optimization using Pearson correlation.
Examples
To demonstrate the
CPP().eval()method, we load theDOM_GSEC_PUexample dataset and its respective feature set (see [Breimann25a]):import aaanalysis as aa aa.options["verbose"] = False df_seq = aa.load_dataset(name="DOM_GSEC_PU", n=50) labels = df_seq["label"].to_list() sf = aa.SequenceFeature() df_parts = sf.get_df_parts(df_seq=df_seq) df_cat = aa.load_scales(name="scales_cat") df_scales = aa.load_scales() df_feat_best = aa.load_features()
We can now create feature sets using the
CPP().run()method:# Use all scales cpp = aa.CPP(df_parts=df_parts) df_feat_all_scales = cpp.run(labels=labels, label_ref=2)
# Use Conformation scales scales_conformation = df_cat[df_cat["category"] == "Conformation"]["scale_id"].to_list() cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_conformation]) df_feat_conformation = cpp.run(labels=labels, label_ref=2)
# Use Energy scales scales_energy = df_cat[df_cat["category"] == "Energy"]["scale_id"].to_list() cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_energy]) df_feat_energy = cpp.run(labels=labels, label_ref=2)
# Use Polarity scales scales_polarity = df_cat[df_cat["category"] == "Polarity"]["scale_id"].to_list() cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_polarity]) df_feat_polarity = cpp.run(labels=labels, label_ref=2)
These sets can be evaluated using the
CPP().eval()method, which needs the list of feature DataFrames (list_df_feat) andlabelsas input:# Create new CPP object with all scales list_df_feat = [df_feat_best, df_feat_all_scales, df_feat_conformation, df_feat_energy, df_feat_polarity] cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales) df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2) aa.display_df(df_eval)
name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust 1 Set 1 (150, [18, 0, 56, 27, 0, 16, 17, 16]) 0.164000 [0.126, 0.142, 0.162, 0.181, 0.244] (np.float64(0.083), np.float64(-0.08)) 21 7.140000 5.100000 2 Set 2 (100, [11, 9, 28, 14, 12, 14, 7, 5]) 0.251000 [0.224, 0.238, 0.248, 0.264, 0.32] (np.float64(0.114), np.float64(-0.105)) 13 7.690000 5.780000 3 Set 3 (100, [0, 0, 100, 0, 0, 0, 0, 0]) 0.209000 [0.17, 0.183, 0.206, 0.229, 0.293] (np.float64(0.104), np.float64(-0.095)) 10 10.000000 5.370000 4 Set 4 (53, [0, 0, 0, 53, 0, 0, 0, 0]) 0.188000 [0.082, 0.153, 0.186, 0.225, 0.32] (np.float64(0.096), np.float64(-0.089)) 3 17.670000 3.860000 5 Set 5 (60, [0, 0, 0, 0, 0, 60, 0, 0]) 0.182000 [0.044, 0.142, 0.178, 0.222, 0.305] (np.float64(0.098), np.float64(-0.094)) 8 7.500000 3.500000 The feature sets can be named using the
names_feature_setsparameter:names_feature_sets = ["Best features", "All scales", "Conformation", "Energy", "Polarity"] df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2, names_feature_sets=names_feature_sets) aa.display_df(df_eval)
name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust 1 Best features (150, [18, 0, 56, 27, 0, 16, 17, 16]) 0.164000 [0.126, 0.142, 0.162, 0.181, 0.244] (np.float64(0.083), np.float64(-0.08)) 24 6.250000 4.580000 2 All scales (100, [11, 9, 28, 14, 12, 14, 7, 5]) 0.251000 [0.224, 0.238, 0.248, 0.264, 0.32] (np.float64(0.114), np.float64(-0.105)) 13 7.690000 4.190000 3 Conformation (100, [0, 0, 100, 0, 0, 0, 0, 0]) 0.209000 [0.17, 0.183, 0.206, 0.229, 0.293] (np.float64(0.104), np.float64(-0.095)) 8 12.500000 5.810000 4 Energy (53, [0, 0, 0, 53, 0, 0, 0, 0]) 0.188000 [0.082, 0.153, 0.186, 0.225, 0.32] (np.float64(0.096), np.float64(-0.089)) 3 17.670000 9.030000 5 Polarity (60, [0, 0, 0, 0, 0, 60, 0, 0]) 0.182000 [0.044, 0.142, 0.178, 0.222, 0.305] (np.float64(0.098), np.float64(-0.094)) 11 5.450000 2.930000 The evaluation can be focused on specific scale categories using the
list_catparameter:df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2, list_cat=["Conformation", "Energy", "Polarity"]) aa.display_df(df_eval)
name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust 1 Set 1 (99, [56, 27, 16]) 0.165000 [0.126, 0.142, 0.165, 0.181, 0.244] (np.float64(0.083), np.float64(-0.079)) 17 5.820000 4.120000 2 Set 2 (56, [28, 14, 14]) 0.252000 [0.224, 0.234, 0.248, 0.266, 0.32] (np.float64(0.114), np.float64(-0.106)) 9 6.220000 3.080000 3 Set 3 (100, [100, 0, 0]) 0.209000 [0.17, 0.183, 0.206, 0.229, 0.293] (np.float64(0.104), np.float64(-0.095)) 12 8.330000 5.530000 4 Set 4 (53, [0, 53, 0]) 0.188000 [0.082, 0.153, 0.186, 0.225, 0.32] (np.float64(0.096), np.float64(-0.089)) 5 10.600000 5.710000 5 Set 5 (60, [0, 0, 60]) 0.182000 [0.044, 0.142, 0.178, 0.222, 0.305] (np.float64(0.098), np.float64(-0.094)) 8 7.500000 3.430000 To compare feature sets with different sets of parts, provide a list of part DataFrames (
list_df_parts) matching to the list of feature DataFrames:# Load one of the provided top scale datasets split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=5) df_scales = aa.load_scales(top60_n=38) list_parts = ["tmd", "tmd_jmd", "jmd_n_tmd_n" ,"tmd_c_jmd_c"] list_df_feat1 = [] list_df_parts = [] for part in list_parts: df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=part) cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales) df_feat = cpp.run(labels=labels, label_ref=2, max_overlap=1, max_cor=1) list_df_feat1.append(df_feat) list_df_parts.append(df_parts)# Create evaluation for unfiltered features df_eval = cpp.eval(list_df_feat=list_df_feat1, labels=labels, label_ref=2, names_feature_sets=list_parts, list_df_parts=list_df_parts) aa.display_df(df_eval)
name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust 1 tmd (100, [9, 16, 28, 2, 10, 12, 10, 13]) 0.139000 [0.067, 0.115, 0.142, 0.162, 0.21] (np.float64(0.055), np.float64(-0.057)) 11 9.090000 3.340000 2 tmd_jmd (100, [11, 13, 18, 14, 5, 23, 5, 11]) 0.165000 [0.092, 0.135, 0.161, 0.19, 0.275] (np.float64(0.056), np.float64(-0.053)) 23 4.350000 2.080000 3 jmd_n_tmd_n (100, [14, 10, 25, 5, 10, 17, 9, 10]) 0.148000 [0.077, 0.122, 0.143, 0.17, 0.246] (np.float64(0.054), np.float64(-0.061)) 10 10.000000 7.710000 4 tmd_c_jmd_c (100, [13, 17, 29, 18, 1, 17, 0, 5]) 0.165000 [0.077, 0.134, 0.162, 0.193, 0.32] (np.float64(0.074), np.float64(-0.07)) 14 7.140000 3.660000