Evidence-Grounded QA Performance Measure
Jump to navigation
Jump to search
An Evidence-Grounded QA Performance Measure is a question answering metric that is an explainability metric evaluating both answer correctness and evidence quality in evidence-grounded QA systems.
- AKA: Evidence-Aware QA Metric, Interpretable QA Score.
- Context:
- It can typically combine Answer Accuracy with evidence relevance.
- It can typically penalize Correct Answers with wrong evidence.
- It can typically reward Well-Supported Answers through evidence alignment.
- It can typically measure Evidence Sufficiency for answer justification.
- It can typically assess Evidence Minimality avoiding redundant passages.
- ...
- It can often incorporate Partial Credit for overlapping evidence.
- It can often employ Hierarchical Scoring for passage, sentence, and span levels.
- It can often include Human Judgments for evidence appropriateness.
- It can often utilize Automatic Alignment Scores between answer and evidence.
- ...
- It can range from being a Joint QA-Evidence Measure to being a Separate Component Measure, depending on its evaluation approach.
- It can range from being a Binary Evidence Measure to being a Graded Evidence Measure, depending on its scoring granularity.
- ...
- It can evaluate Evidence-Grounded Question Answering Task performance.
- It can diagnose System Weaknesses in retrieval or extraction.
- It can guide Model Development for interpretable QA.
- It can support Benchmark Creation for explainable QA.
- ...
- Example(s):
- Natural Questions Score combining short answer, long answer, and evidence.
- HotpotQA Joint Score requiring answer EM and supporting fact F1.
- Evidence-Conditioned F1 measuring answer quality given correct evidence.
- Passage Relevance Score for retrieved evidence quality.
- Answer Attribution Score linking answer tokens to evidence spans.
- ...
- Counter-Example(s):
- Pure Answer EM, which ignores evidence quality.
- Passage Retrieval MRR, which measures ranking not QA performance.
- Reading Speed Metric, which evaluates efficiency not quality.
- See: QA Evaluation Metric, Explainability Measure, Joint Performance Metric, Evidence Quality Score.