Evidence-Based Classification Performance Measure
Jump to navigation
Jump to search
An Evidence-Based Classification Performance Measure is a classification performance measure that is an explainability metric evaluating both classification accuracy and evidence quality in evidence-based text classification systems.
- AKA: Joint Classification-Evidence Metric, Evidence-Aware Performance Measure.
- Context:
- It can typically combine Label Accuracy Metrics with evidence relevance metrics.
- It can typically penalize Correct Classifications with incorrect evidences.
- It can typically reward Faithful Explanations through evidence alignment scores.
- It can typically measure Evidence Completeness via coverage metrics.
- It can typically assess Evidence Minimality through precision metrics.
- ...
- It can often incorporate Human Evaluations for evidence plausibility.
- It can often utilize Automatic Scoring Methods for scalable evaluation.
- It can often include Partial Credit Mechanisms for overlapping evidences.
- It can often employ Weighted Scorings based on evidence importance.
- ...
- It can range from being a Binary Evidence-Based Measure to being a Graded Evidence-Based Measure, depending on its scoring granularity.
- It can range from being a Token-Level Evidence Measure to being a Sentence-Level Evidence Measure, depending on its evaluation granularity.
- ...
- It can evaluate Evidence-Based Text Classification Task performance.
- It can be computed from Predicted Labels and extracted evidence spans.
- It can be normalized between zero and one for cross-task comparison.
- It can guide Model Selection in evidence-based NLP applications.
- ...
- Example(s):
- FEVER Score, which requires both correct label and complete evidence set.
- Evidence F1 Score, measuring token-level overlap with gold evidence spans.
- Comprehensiveness Score, quantifying prediction change when evidence removed.
- Sufficiency Score, measuring prediction confidence with only evidence.
- Evidence IOU Score, calculating intersection over union of evidence spans.
- ...
- Counter-Example(s):
- Pure Accuracy Metrics, which ignore evidence quality.
- BLEU Scores, which measure text generation not evidence grounding.
- Perplexity Metrics, which evaluate language modeling not classification evidence.
- See: Explainability Metric, Classification Performance Measure, Evidence Quality Metric, Interpretability Evaluation.