Binary Classification Performance Measure

A Binary Classification Performance Measure is a categorical prediction task performance measure for binary classification tasks that evaluates two-class predictions.

AKA: Boolean Classification Performance Measure, Two-Class Classification Metric, Binary Classifier Performance Metric, Predictive Relation Performance Metric.
Context:
- It can (typically) be derived from a 2x2 Confusion Matrix containing True Positives, True Negatives, False Positives, and False Negatives.
- It can (typically) evaluate Type I Errors (false positives) and Type II Errors (false negatives) separately.
- It can (typically) be affected by Class Imbalance between positive and negative classes.
- It can (often) require Threshold Selection for probabilistic binary classifiers.
- It can (often) be optimized for specific Cost-Sensitive Learning scenarios where error types have different impacts.
- It can (often) guide Binary Classifier Selection based on domain-specific requirements.
- It can (often) be visualized using ROC Curves, Precision-Recall Curves, or Lift Charts.
- It can range from being a Threshold-Dependent Binary Measure to being a Threshold-Independent Binary Measure, depending on its decision boundary requirement.
- It can range from being a Class-Specific Binary Measure to being a Combined Binary Measure, depending on its class focus.
- It can range from being a Point Estimate Binary Measure to being a Probabilistic Binary Measure, depending on its prediction type.
- It can integrate with Binary Classification Systems for model evaluation.
- It can integrate with Hyperparameter Optimization Frameworks for model tuning.
- ...
Example(s):
- Basic Rate Measures, such as:
  - True Positive Rate (Sensitivity/Recall): TP / (TP + FN).
  - True Negative Rate (Specificity): TN / (TN + FP).
  - False Positive Rate: FP / (TN + FP).
  - False Negative Rate: FN / (TP + FN).
- Precision-Based Measures, such as:
  - Precision (Positive Predictive Value): TP / (TP + FP).
  - Negative Predictive Value: TN / (TN + FN).
  - False Discovery Rate: FP / (TP + FP).
  - False Omission Rate: FN / (TN + FN).
- Combined Measures, such as:
  - F1-Score: Harmonic mean of precision and recall.
  - F-Beta Score: Weighted harmonic mean with beta parameter.
  - G-Mean: Geometric mean of sensitivity and specificity.
  - Balanced Accuracy: Average of sensitivity and specificity.
- Correlation Measures, such as:
  - Matthews Correlation Coefficient: Correlation between predicted and actual binary labels.
  - Phi Coefficient: Pearson correlation for binary variables.
  - Cohen's Kappa: Agreement corrected for chance.
  - Youden's J Statistic: Sensitivity + Specificity - 1.
- Probabilistic Measures, such as:
  - Log Loss (Binary Cross-Entropy): Logarithmic penalty for probabilistic predictions.
  - Brier Score: Mean squared difference between predicted probabilities and outcomes.
  - Expected Calibration Error: Average difference between confidence and accuracy.
- Threshold-Independent Measures, such as:
  - Area Under ROC Curve (AUC-ROC): Performance across all thresholds.
  - Area Under Precision-Recall Curve (AUC-PR): Performance for imbalanced datasets.
  - Average Precision: Weighted mean of precisions at each threshold.
- Information-Theoretic Measures, such as:
  - Mutual Information: Information shared between predictions and labels.
  - Information Gain: Reduction in entropy from predictions.
  - Kullback-Leibler Divergence: Divergence between predicted and actual distributions.
- Cost-Sensitive Measures, such as:
  - Expected Cost: Weighted sum of error costs.
  - Profit Score: Economic value of predictions.
  - Utility Score: Domain-specific utility function.
- Statistical Test Measures, such as:
  - McNemar's Test: Statistical significance of classifier differences.
  - Binomial Test: Significance of classification accuracy.
  - DeLong's Test: Significance of AUC differences.
- ...
Counter-Example(s):
- Multi-Class Classification Performance Measure, which evaluates more than two classes.
- Regression Performance Measure, which evaluates continuous predictions.
- Ranking Performance Measure, which evaluates ordering quality.
- Clustering Performance Measure, which evaluates unsupervised grouping.
- Multi-Label Classification Measure, which allows multiple labels per instance.
See: Binary Classification Task, Confusion Matrix, ROC Analysis, Precision-Recall Analysis, Type I and Type II Errors, Class Imbalance Problem, Threshold Selection, Cost-Sensitive Learning, Statistical Hypothesis Testing, Classifier Comparison, Model Calibration, Binary Classifier.

References

2009

Eric W. Weisstein. “Statistical Test." From MathWorld -- A Wolfram Web Resource. http://mathworld.wolfram.com/StatisticalTest.html
- QUOTE: A test used to determine the statistical significance of an observation. Two main types of error can occur:
  - 1. A type I error occurs when a false negative result is obtained in terms of the null hypothesis by obtaining a false positive measurement.
  - 2. A type II error occurs when a false positive result is obtained in terms of the null hypothesis by obtaining a false negative measurement.
- The probability that a statistical test will be positive for a true statistic is sometimes called the test's sensitivity, and the probability that a test will be negative for a negative statistic is sometimes called the specificity. The following table summarizes the names given to the various combinations of the actual state of affairs and observed test results.
  - result name
  - true positive result sensitivity
  - false negative result 1-sensitivity
  - true negative result specificity
  - false positive result 1-specificity

Binary Classification Performance Measure

References

2009

Navigation menu

Search