CPP.eval

CPP.eval(list_df_feat, labels, label_test=1, label_ref=0, min_th=0.0, names_feature_sets=None, list_cat=None, list_df_parts=None, n_jobs=1)[source]

Evaluate the quality of different sets of identified Comparative Physicochemical Profiling (CPP) features.

Feature sets are evaluated regarding two quality groups:

Discriminative Power: The capability of features to distinguish between test and reference datasets.
Redundancy: Assessed by the optimized number of clusters, based on Pearson correlation among features.

Added in version 0.1.0.

Parameters:

list_df_feat (list of pd.DataFrames) – List of feature DataFrames each of shape (n_features, n_feature_info)
labels (array-like, shape (n_samples,)) – Class labels for samples in sequence DataFrame (typically, test=1, reference=0).
label_test (int, default=1,) – Class label of test group in labels.
label_ref (int, default=0,) – Class label of reference group in labels.
min_th (float, default=0.0) – Pearson correlation threshold for clustering optimization (between -1 and 1).
names_feature_sets (list of str, optional) – List of names for feature sets corresponding to list_df_feat.
list_cat (list of str, optional) – List of scale categories to retrieve number of features from. Default: [‘ASA/Volume’, ‘Composition’, ‘Conformation’, ‘Energy’, ‘Others’, ‘Polarity’, ‘Shape’, ‘Structure-Activity’]
list_df_parts (list of pd.DataFrames, optional) – List of part DataFrames each of shape (n_samples, n_parts). Must match with list_df_feat.
n_jobs (int, None, or -1, default=1) – Number of CPU cores (>=1) used for multiprocessing. If None, the number is optimized automatically. If -1, the number is set to all available cores. Overridden by options['n_jobs'] when set.

Returns:

df_eval – Evaluation results for each set of identified features. For each set, statistical measures were averaged across all features.

Return type:

pd.DataFrame

Notes

df_eval includes the following columns (upper-case indicates direct reference to df_feat columns):
- ‘name’: Name of the feature set, typically based on CPP run settings, if names is provided.
- ‘n_features’: Tuple with total number of features and list of number of features per scale category from list_cat.
- ‘avg_ABS_AUC’: Absolute Area Under the Curve (AUC) averaged across all features.
- ‘range_ABS_AUC’: Quintile range of absolute AUC among all features (min, 25%, median, 75%, max).
- ‘avg_MEAN_DIF’: Tuple of mean differences averaged across all features separately for features with positive and negative ‘mean_dif’.
- ‘n_clusters’: Optimal number of clusters [2,100].
- ‘avg_n_feat_per_clust’: Average number of features per cluster.
- ‘std_n_feat_per_clust’: Standard deviation of feature number per cluster.
‘n_clusters’ is optimized for a KMeans clustering model based on the minimum Pearson correlation between the cluster center and all cluster members across all clusters (min_cor_center in AAclust), which has to exceed the minimum correlation threshold min_th.

See also

CPPPlot.eval(): the respective plotting method.
AAontology: Classification of Amino Acid Scales for details on scale categories.
CPP.run() for details on CPP statistical measures.
comp_auc_adjusted() for details on ‘abs_auc’.
sklearn.cluster.KMeans for employed clustering model.
AAclust ([Breimann24a]) for details on cluster optimization using Pearson correlation.

Examples

To demonstrate the CPP().eval() method, we load the DOM_GSEC_PU example dataset and its respective feature set (see [Breimann25]):

import aaanalysis as aa
aa.options["verbose"] = False
df_seq = aa.load_dataset(name="DOM_GSEC_PU", n=50)
labels = df_seq["label"].to_list()
sf = aa.SequenceFeature()
df_parts = sf.get_df_parts(df_seq=df_seq)
df_cat = aa.load_scales(name="scales_cat")
df_scales = aa.load_scales()
df_feat_best = aa.load_features()

We can now create feature sets using the CPP().run() method:

# Use all scales
cpp = aa.CPP(df_parts=df_parts)
df_feat_all_scales = cpp.run(labels=labels, label_ref=2)

# Use Conformation scales
scales_conformation = df_cat[df_cat["category"] == "Conformation"]["scale_id"].to_list()
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_conformation])
df_feat_conformation = cpp.run(labels=labels, label_ref=2)

# Use Energy scales
scales_energy = df_cat[df_cat["category"] == "Energy"]["scale_id"].to_list()
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_energy])
df_feat_energy = cpp.run(labels=labels, label_ref=2)

# Use Polarity scales
scales_polarity = df_cat[df_cat["category"] == "Polarity"]["scale_id"].to_list()
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales[scales_polarity])
df_feat_polarity = cpp.run(labels=labels, label_ref=2)

These sets can be evaluated using the CPP().eval() method, which needs the list of feature DataFrames (list_df_feat) and labels as input:

# Create new CPP object with all scales
list_df_feat = [df_feat_best, df_feat_all_scales, df_feat_conformation, df_feat_energy, df_feat_polarity]
cpp = aa.CPP(df_parts=df_parts, df_scales=df_scales)
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2)
aa.display_df(df_eval)

	name	n_features	avg_ABS_AUC	range_ABS_AUC	avg_MEAN_DIF	n_clusters	avg_n_feat_per_clust	std_n_feat_per_clust
1	Set 1	(150, [18, 0, 56, 27, 0, 16, 17, 16, 0, 0, 0, 0])	0.164000	[0.126, 0.142, 0.162, 0.181, 0.244]	(np.float64(0.083), np.float64(-0.08))	24	6.250000	4.470000
2	Set 2	(100, [11, 9, 28, 14, 12, 14, 7, 5, 0, 0, 0, 0])	0.251000	[0.224, 0.238, 0.248, 0.264, 0.32]	(np.float64(0.114), np.float64(-0.105))	14	7.140000	3.640000
3	Set 3	(100, [0, 0, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0])	0.209000	[0.17, 0.183, 0.206, 0.229, 0.293]	(np.float64(0.104), np.float64(-0.095))	10	10.000000	4.690000
4	Set 4	(53, [0, 0, 0, 53, 0, 0, 0, 0, 0, 0, 0, 0])	0.188000	[0.082, 0.153, 0.186, 0.225, 0.32]	(np.float64(0.096), np.float64(-0.089))	3	17.670000	4.780000
5	Set 5	(60, [0, 0, 0, 0, 0, 60, 0, 0, 0, 0, 0, 0])	0.182000	[0.044, 0.142, 0.178, 0.222, 0.305]	(np.float64(0.098), np.float64(-0.094))	9	6.670000	4.240000

The feature sets can be named using the names_feature_sets parameter:

names_feature_sets = ["Best features", "All scales", "Conformation", "Energy", "Polarity"]
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2, names_feature_sets=names_feature_sets)
aa.display_df(df_eval)

	name	n_features	avg_ABS_AUC	range_ABS_AUC	avg_MEAN_DIF	n_clusters	avg_n_feat_per_clust	std_n_feat_per_clust
1	Best features	(150, [18, 0, 56, 27, 0, 16, 17, 16, 0, 0, 0, 0])	0.164000	[0.126, 0.142, 0.162, 0.181, 0.244]	(np.float64(0.083), np.float64(-0.08))	23	6.520000	4.220000
2	All scales	(100, [11, 9, 28, 14, 12, 14, 7, 5, 0, 0, 0, 0])	0.251000	[0.224, 0.238, 0.248, 0.264, 0.32]	(np.float64(0.114), np.float64(-0.105))	20	5.000000	3.070000
3	Conformation	(100, [0, 0, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0])	0.209000	[0.17, 0.183, 0.206, 0.229, 0.293]	(np.float64(0.104), np.float64(-0.095))	12	8.330000	5.170000
4	Energy	(53, [0, 0, 0, 53, 0, 0, 0, 0, 0, 0, 0, 0])	0.188000	[0.082, 0.153, 0.186, 0.225, 0.32]	(np.float64(0.096), np.float64(-0.089))	5	10.600000	6.220000
5	Polarity	(60, [0, 0, 0, 0, 0, 60, 0, 0, 0, 0, 0, 0])	0.182000	[0.044, 0.142, 0.178, 0.222, 0.305]	(np.float64(0.098), np.float64(-0.094))	7	8.570000	4.440000

The evaluation can be focused on specific scale categories using the list_cat parameter:

df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels, label_ref=2, list_cat=["Conformation", "Energy", "Polarity"])
aa.display_df(df_eval)

	name	n_features	avg_ABS_AUC	range_ABS_AUC	avg_MEAN_DIF	n_clusters	avg_n_feat_per_clust	std_n_feat_per_clust
1	Set 1	(99, [56, 27, 16])	0.165000	[0.126, 0.142, 0.165, 0.181, 0.244]	(np.float64(0.083), np.float64(-0.079))	18	5.500000	4.070000
2	Set 2	(56, [28, 14, 14])	0.252000	[0.224, 0.234, 0.248, 0.266, 0.32]	(np.float64(0.114), np.float64(-0.106))	9	6.220000	2.820000
3	Set 3	(100, [100, 0, 0])	0.209000	[0.17, 0.183, 0.206, 0.229, 0.293]	(np.float64(0.104), np.float64(-0.095))	8	12.500000	9.770000
4	Set 4	(53, [0, 53, 0])	0.188000	[0.082, 0.153, 0.186, 0.225, 0.32]	(np.float64(0.096), np.float64(-0.089))	3	17.670000	4.780000
5	Set 5	(60, [0, 0, 60])	0.182000	[0.044, 0.142, 0.178, 0.222, 0.305]	(np.float64(0.098), np.float64(-0.094))	15	4.000000	3.100000

To compare feature sets with different sets of parts, provide a list of part DataFrames (list_df_parts) matching to the list of feature DataFrames:

# Load one of the provided top scale datasets
split_kws = sf.get_split_kws(split_types=["Segment"], n_split_max=5)
df_scales = aa.load_scales(top60_n=38)
list_parts = ["tmd", "tmd_jmd", "jmd_n_tmd_n" ,"tmd_c_jmd_c"]
list_df_feat1 = []
list_df_parts = []
for part in list_parts:
    df_parts = sf.get_df_parts(df_seq=df_seq, list_parts=part)
    cpp = aa.CPP(df_parts=df_parts, split_kws=split_kws, df_scales=df_scales)
    df_feat = cpp.run(labels=labels, label_ref=2, max_overlap=1, max_cor=1)
    list_df_feat1.append(df_feat)
    list_df_parts.append(df_parts)

# Create evaluation for unfiltered features
df_eval = cpp.eval(list_df_feat=list_df_feat1, labels=labels, label_ref=2, names_feature_sets=list_parts, list_df_parts=list_df_parts)
aa.display_df(df_eval)

	name	n_features	avg_ABS_AUC	range_ABS_AUC	avg_MEAN_DIF	n_clusters	avg_n_feat_per_clust	std_n_feat_per_clust
1	tmd	(100, [9, 16, 28, 2, 10, 12, 10, 13, 0, 0, 0, 0])	0.139000	[0.067, 0.115, 0.142, 0.162, 0.21]	(np.float64(0.055), np.float64(-0.057))	5	20.000000	6.070000
2	tmd_jmd	(100, [11, 13, 18, 14, 5, 23, 5, 11, 0, 0, 0, 0])	0.165000	[0.092, 0.135, 0.161, 0.19, 0.275]	(np.float64(0.056), np.float64(-0.053))	12	8.330000	4.210000
3	jmd_n_tmd_n	(100, [14, 10, 25, 5, 10, 17, 9, 10, 0, 0, 0, 0])	0.148000	[0.077, 0.122, 0.143, 0.17, 0.246]	(np.float64(0.054), np.float64(-0.061))	9	11.110000	5.190000
4	tmd_c_jmd_c	(100, [13, 17, 29, 18, 1, 17, 0, 5, 0, 0, 0, 0])	0.165000	[0.077, 0.134, 0.162, 0.193, 0.32]	(np.float64(0.074), np.float64(-0.07))	15	6.670000	3.280000

Further parameters. CPP.eval also accepts: label_test — Class label of test group in labels; min_th — Pearson correlation threshold for clustering optimization (between -1 and 1); n_jobs — Number of CPU cores (>=1) used for multiprocessing.

# Further parameters: name the test group (label_test), set the redundancy-clustering
# correlation threshold (min_th), and the CPU core count (n_jobs).
# (cpp / df_parts / df_scales were reassigned above, so rebuild the all-parts, all-scales
# CPP that matches the feature sets in list_df_feat.)
df_parts_full = sf.get_df_parts(df_seq=df_seq)
cpp = aa.CPP(df_parts=df_parts_full, df_scales=aa.load_scales())
df_eval = cpp.eval(list_df_feat=list_df_feat, labels=labels,
                        label_test=1, label_ref=2, min_th=0.5, n_jobs=1)
aa.display_df(df_eval)

	name	n_features	avg_ABS_AUC	range_ABS_AUC	avg_MEAN_DIF	n_clusters	avg_n_feat_per_clust	std_n_feat_per_clust
1	Set 1	(150, [18, 0, 56, 27, 0, 16, 17, 16, 0, 0, 0, 0])	0.164000	[0.126, 0.142, 0.162, 0.181, 0.244]	(np.float64(0.083), np.float64(-0.08))	90	1.670000	0.980000
2	Set 2	(100, [11, 9, 28, 14, 12, 14, 7, 5, 0, 0, 0, 0])	0.251000	[0.224, 0.238, 0.248, 0.264, 0.32]	(np.float64(0.114), np.float64(-0.105))	52	1.920000	1.190000
3	Set 3	(100, [0, 0, 100, 0, 0, 0, 0, 0, 0, 0, 0, 0])	0.209000	[0.17, 0.183, 0.206, 0.229, 0.293]	(np.float64(0.104), np.float64(-0.095))	74	1.350000	0.600000
4	Set 4	(53, [0, 0, 0, 53, 0, 0, 0, 0, 0, 0, 0, 0])	0.188000	[0.082, 0.153, 0.186, 0.225, 0.32]	(np.float64(0.096), np.float64(-0.089))	32	1.660000	0.990000
5	Set 5	(60, [0, 0, 0, 0, 0, 60, 0, 0, 0, 0, 0, 0])	0.182000	[0.044, 0.142, 0.178, 0.222, 0.305]	(np.float64(0.098), np.float64(-0.094))	47	1.280000	0.790000