aaanalysis.CPP.eval

CPP.eval(list_df_feat=None, labels=None, label_test=1, label_ref=0, min_th=0.0, names_feature_sets=None, list_cat=None, list_df_parts=None, n_jobs=1)[source]

Evaluate the quality of different sets of identified CPP features.

Feature sets are evaluated regarding two quality groups:

  • Discriminative Power: The capability of features to distinguish between test and reference datasets.

  • Redundancy: Assessed by the optimized number of clusters, based on Pearson correlation among features.

Parameters:
  • list_df_feat (list of pd.DataFrames) – List of feature DataFrames each of shape (n_features, n_feature_info)

  • labels (array-like, shape (n_samples,)) – Class labels for samples in sequence DataFrame (typically, test=1, reference=0).

  • label_test (int, default=1,) – Class label of test group in labels.

  • label_ref (int, default=0,) – Class label of reference group in labels.

  • min_th (float, default=0.0) – Pearson correlation threshold for clustering optimization (between -1 and 1).

  • names_feature_sets (list of str, optional) – List of names for feature sets corresponding to list_df_feat.

  • list_cat (list of str, optional) – List of scale categories to retrieve number of features from. Default: [‘ASA/Volume’, ‘Composition’, ‘Conformation’, ‘Energy’, ‘Others’, ‘Polarity’, ‘Shape’, ‘Structure-Activity’]

  • list_df_parts (list of pd.DataFrames, optional) – List of part DataFrames each of shape (n_samples, n_parts). Must match with list_df_feat.

  • n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_eval – Evaluation results for each set of identified features. For each set, statistical measures were averaged across all features.

Return type:

pd.DataFrame

Notes

  • df_eval includes the following columns (upper-case indicates direct reference to df_feat columns):

    • ‘name’: Name of the feature set, typically based on CPP run settings, if names is provided.

    • ‘n_features’: Tuple with total number of features and list of number of features per scale category from list_cat.

    • ‘avg_ABS_AUC’: Absolute AUC averaged across all features.

    • ‘range_ABS_AUC’: Quintile range of absolute AUC among all features (min, 25%, median, 75%, max).

    • ‘avg_MEAN_DIF’: Tuple of mean differences averaged across all features separately for features with positive and negative ‘mean_dif’.

    • ‘n_clusters’: Optimal number of clusters [2,100].

    • ‘avg_n_feat_per_clust’: Average number of features per cluster.

    • ‘std_n_feat_per_clust’: Standard deviation of feature number per cluster.

  • ‘n_clusters’ is optimized for a KMeans clustering model based on the minimum Pearson correlation between the cluster center and all cluster members across all clusters (min_cor_center in AAclust), which has to exceed the minimum correlation threshold min_th.

See also

Examples

To demonstrate the CPP().eval() method, we load the DOM_GSEC_PU example dataset and its respective feature set (see [Breimann25a]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC_PU", n=50)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_cat = aa.load_scales(name="scales_cat")
df_scales = aa.load_scales()
df_feat_best = aa.load_features()

We can now create feature sets using the CPP().run() method:

# Use all scales
cpp = aa.CPP(df_parts=df_parts)
df_feat_all_scales = cpp.run(labels=labels, label_ref=2)
# Use Conformation scales
scales_conformation = df_cat[df_cat["category"] == "Conformation"]["scale_id"].to_list()
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_conformation])
df_feat_conformation = cpp.run(labels=labels, label_ref=2)
# Use Energy scales
scales_energy = df_cat[df_cat["category"] == "Energy"]["scale_id"].to_list()
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_energy])
df_feat_energy = cpp.run(labels=labels, label_ref=2)
# Use Polarity scales
scales_polarity = df_cat[df_cat["category"] == "Polarity"]["scale_id"].to_list()
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_polarity])
df_feat_polarity = cpp.run(labels=labels, label_ref=2)

These sets can be evaluated using the CPP().eval() method, which needs the list of feature DataFrames (list_df_feat) and labels as input:

# Create new CPP object with all scales
list_df_feat = [df_feat_best, df_feat_all_scales, df_feat_conformation, df_feat_energy, df_feat_polarity]
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales)
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2)
aa.display_df(df_eval)
  name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust
1 Set 1 (150, [18, 0, 56, 27, 0, 16, 17, 16]) 0.164000 [0.126, 0.142, 0.162, 0.181, 0.244] (np.float64(0.083), np.float64(-0.08)) 21 7.140000 5.100000
2 Set 2 (100, [11, 9, 28, 14, 12, 14, 7, 5]) 0.251000 [0.224, 0.238, 0.248, 0.264, 0.32] (np.float64(0.114), np.float64(-0.105)) 13 7.690000 5.780000
3 Set 3 (100, [0, 0, 100, 0, 0, 0, 0, 0]) 0.209000 [0.17, 0.183, 0.206, 0.229, 0.293] (np.float64(0.104), np.float64(-0.095)) 10 10.000000 5.370000
4 Set 4 (53, [0, 0, 0, 53, 0, 0, 0, 0]) 0.188000 [0.082, 0.153, 0.186, 0.225, 0.32] (np.float64(0.096), np.float64(-0.089)) 3 17.670000 3.860000
5 Set 5 (60, [0, 0, 0, 0, 0, 60, 0, 0]) 0.182000 [0.044, 0.142, 0.178, 0.222, 0.305] (np.float64(0.098), np.float64(-0.094)) 8 7.500000 3.500000

The feature sets can be named using the names_feature_sets parameter:

names_feature_sets = ["Best features", "All scales", "Conformation", "Energy", "Polarity"]
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2, names_feature_sets=names_feature_sets)
aa.display_df(df_eval)
  name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust
1 Best features (150, [18, 0, 56, 27, 0, 16, 17, 16]) 0.164000 [0.126, 0.142, 0.162, 0.181, 0.244] (np.float64(0.083), np.float64(-0.08)) 24 6.250000 4.580000
2 All scales (100, [11, 9, 28, 14, 12, 14, 7, 5]) 0.251000 [0.224, 0.238, 0.248, 0.264, 0.32] (np.float64(0.114), np.float64(-0.105)) 13 7.690000 4.190000
3 Conformation (100, [0, 0, 100, 0, 0, 0, 0, 0]) 0.209000 [0.17, 0.183, 0.206, 0.229, 0.293] (np.float64(0.104), np.float64(-0.095)) 8 12.500000 5.810000
4 Energy (53, [0, 0, 0, 53, 0, 0, 0, 0]) 0.188000 [0.082, 0.153, 0.186, 0.225, 0.32] (np.float64(0.096), np.float64(-0.089)) 3 17.670000 9.030000
5 Polarity (60, [0, 0, 0, 0, 0, 60, 0, 0]) 0.182000 [0.044, 0.142, 0.178, 0.222, 0.305] (np.float64(0.098), np.float64(-0.094)) 11 5.450000 2.930000

The evaluation can be focused on specific scale categories using the list_cat parameter:

df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2, list_cat=["Conformation", "Energy", "Polarity"])
aa.display_df(df_eval)
  name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust
1 Set 1 (99, [56, 27, 16]) 0.165000 [0.126, 0.142, 0.165, 0.181, 0.244] (np.float64(0.083), np.float64(-0.079)) 17 5.820000 4.120000
2 Set 2 (56, [28, 14, 14]) 0.252000 [0.224, 0.234, 0.248, 0.266, 0.32] (np.float64(0.114), np.float64(-0.106)) 9 6.220000 3.080000
3 Set 3 (100, [100, 0, 0]) 0.209000 [0.17, 0.183, 0.206, 0.229, 0.293] (np.float64(0.104), np.float64(-0.095)) 12 8.330000 5.530000
4 Set 4 (53, [0, 53, 0]) 0.188000 [0.082, 0.153, 0.186, 0.225, 0.32] (np.float64(0.096), np.float64(-0.089)) 5 10.600000 5.710000
5 Set 5 (60, [0, 0, 60]) 0.182000 [0.044, 0.142, 0.178, 0.222, 0.305] (np.float64(0.098), np.float64(-0.094)) 8 7.500000 3.430000

To compare feature sets with different sets of parts, provide a list of part DataFrames (list_df_parts) matching to the list of feature DataFrames:

# Load one of the provided top scale datasets
split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=5)
df_scales = aa.load_scales(top60_n=38)
list_parts = ["tmd", "tmd_jmd", "jmd_n_tmd_n" ,"tmd_c_jmd_c"]
list_df_feat1 = []
list_df_parts = []
for part in list_parts:
    df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=part)
    cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales)
    df_feat = cpp.run(labels=labels, label_ref=2, max_overlap=1, max_cor=1)
    list_df_feat1.append(df_feat)
    list_df_parts.append(df_parts)
# Create evaluation for unfiltered features
df_eval = cpp.eval(list_df_feat=list_df_feat1, labels=labels, label_ref=2, names_feature_sets=list_parts, list_df_parts=list_df_parts)
aa.display_df(df_eval)
  name n_features avg_ABS_AUC range_ABS_AUC avg_MEAN_DIF n_clusters avg_n_feat_per_clust std_n_feat_per_clust
1 tmd (100, [9, 16, 28, 2, 10, 12, 10, 13]) 0.139000 [0.067, 0.115, 0.142, 0.162, 0.21] (np.float64(0.055), np.float64(-0.057)) 11 9.090000 3.340000
2 tmd_jmd (100, [11, 13, 18, 14, 5, 23, 5, 11]) 0.165000 [0.092, 0.135, 0.161, 0.19, 0.275] (np.float64(0.056), np.float64(-0.053)) 23 4.350000 2.080000
3 jmd_n_tmd_n (100, [14, 10, 25, 5, 10, 17, 9, 10]) 0.148000 [0.077, 0.122, 0.143, 0.17, 0.246] (np.float64(0.054), np.float64(-0.061)) 10 10.000000 7.710000
4 tmd_c_jmd_c (100, [13, 17, 29, 18, 1, 17, 0, 5]) 0.165000 [0.077, 0.134, 0.162, 0.193, 0.32] (np.float64(0.074), np.float64(-0.07)) 14 7.140000 3.660000