comp_detection_metrics

comp_detection_metrics(list_scores, list_positions, threshold=0.5, tolerance=0)[source]

Compute pooled detection metrics at a fixed score threshold.

Answers “is the true site actually called?” (distinct from ranking): residues scoring >= threshold are positive calls, pooled across proteins into true positives, false positives, false negatives, and true negatives (TP/FP/FN/TN), then reduced to recall / precision / F1 / Matthews Correlation Coefficient (MCC). tolerance credits a call within tolerance residues of a true site (each site at most once).

Added in version 1.1.0.

Parameters:
  • list_scores (list of array-like) – Per-protein per-residue score vectors. NaN scores are ignored.

  • list_positions (list of array-like) – Per-protein 0-based indices of positive sites.

  • threshold (float, default=0.5) – Score threshold for a positive call.

  • tolerance (int, default=0) – Positional tolerance (in residues) for counting a TP.

Returns:

metrics – Keys recall, precision, f1, mcc (floats) and tp, fp, fn, tn (ints).

Return type:

dict

See also

Examples

comp_detection_metrics pools per-residue predictions at a fixed score threshold and returns recall / precision / F1 / MCC (and the TP/FP/FN/TN counts) as a dict.

import numpy as np
import aaanalysis as aa

list_scores = [np.array([0.9, 0.1, 0.8, 0.2]), np.array([0.1, 0.9, 0.2, 0.7])]
list_positions = [[0, 2], [1, 3]]
aa.comp_detection_metrics(list_scores=list_scores, list_positions=list_positions,
                          threshold=0.5)
{'recall': 1.0,
 'precision': 1.0,
 'f1': 1.0,
 'mcc': np.float64(1.0),
 'tp': 4,
 'fp': 0,
 'fn': 0,
 'tn': 4}

tolerance widens what counts as a hit: a predicted residue is a true positive when it lies within tolerance positions of a true site (default 0 = exact match). Relaxing it recovers near-miss predictions:

import pandas as pd
df_tol = pd.DataFrame(
    [aa.comp_detection_metrics(list_scores=list_scores, list_positions=list_positions,
                               threshold=0.5, tolerance=t) for t in [0, 1]],
    index=["tolerance=0", "tolerance=1"])
aa.display_df(df_tol)
  recall precision f1 mcc tp fp fn tn
tolerance=0 1.000000 1.000000 1.000000 1.000000 4 0 0 4
tolerance=1 1.000000 1.000000 1.000000 1.000000 4 0 0 4