Text Summarization Performance Measure
Jump to navigation
Jump to search
A Text Summarization Performance Measure is an NLP performance metric for a text summarization system's ability to solve a text summarization task.
- Context:
- It can involve assessing the Summary Coherence, Summary Relevance, and Summary Fluency.
- It can inform a Summarization Application's usability.
- …
- Example(s):
- ROUGE, which evaluates the quality of summary by comparing it to reference summaries.
- BLEU, typically used for machine translation but also applicable in summarization to assess the precision of generated summaries.
- METEOR, which considers word-to-word matches between the generated summary and reference texts.
- BERTScore, leveraging BERT embeddings to evaluate semantic similarity between generated and reference summaries.
- Text Summarization Precision measures the proportion of content in the generated summary that is relevant or important (i.e., how much of the summarized content is actually present in the original text).
- Text Summarization Recall assesses how much of the important content from the original text is captured in the summary.
- Reference and Document Aware Semantic Score (RDASS).
- ...
- Counter-Example(s):
- See: Text Summarization Faithfulness Evaluation, Automatic Text Summarization, Natural Language Processing.
References
2023
- (Yun et al., 2023) ⇒ Jiseon Yun, Jae Eui Sohn, and Sunghyon Kyeong. (2023). “Fine-Tuning Pretrained Language Models to Enhance Dialogue Summarization in Customer Service Centers.” In: Proceedings of the Fourth ACM International Conference on AI in Finance. doi:10.1145/3604237.3626838
- QUOTE: ... The results demonstrated that the fine-tuned model based on KakaoBank’s internal datasets outperformed the reference model, showing a 199% and 12% improvement in ROUGE-L and RDASS, respectively. ...
- QUOTE: ... RDASS is a comprehensive evaluation metric that considers the relationships among the original document, reference summary, and model-generated summary. Compared to ROUGE, RDASS performed better in terms of relevance, consistency, and fluency of sentences in Korean. Therefore, we employed both ROUGE and RDASS as evaluation metrics, considering their respective strengths and weaknesses of each metric. ...
- QUOTE: ... RDASS measures the similarity between the vectors of the original document and reference summary. Moreover, it measures the similarity between the vectors of the original document and generated summary. Finally, RDASS can be obtained by computing their average. ...
2023
- (Foysal & Böck, 2023) ⇒ Abdullah Al Foysal, and Ronald Böck. (2023). “Who Needs External References?âText Summarization Evaluation Using Original Documents.” In: AI, 4(4). doi:10.3390/ai4040049
- NOTEs:
- It introduces a new metric, SUSWIR (Summary Score without Reference), which evaluates automatic text summarization quality by considering Semantic Similarity, Relevance, Redundancy, and Bias Avoidance, without requiring human-generated reference summaries.
- It emphasizes the limitations of traditional text summarization evaluation methods like ROUGE, BLEU, and METEOR, particularly in situations where no reference summaries are available, motivating the need for a more flexible and unbiased approach.
- It demonstrates SUSWIR's effectiveness through extensive testing on various datasets, including CNN/Daily Mail and BBC Articles, showing that this new metric provides reliable and consistent assessments compared to traditional methods.
- NOTEs:
2023
- (Liu et al., 2023) ⇒ Yu Lu Liuu, Meng Cao, Su Lin Blodgett, Jackie Chi Kit Cheung, Alexandra Olteanu, and Adam Trischler. (2023). “Responsible AI Considerations in Text Summarization Research: A Review of Current Practices.” arXiv preprint arXiv:2311.11103.
- NOTE:
- It emphasizes the growing need for reflection on Ethical Considerations, adverse impacts, and other Responsible AI (RAI) issues in AI and NLP Tasks, with a specific focus on Text Summarization.
- It explores how bias and Ethical Considerations are addressed, providing context for their own investigation in Text Summarization.
- It discusses the importance and challenges of Text Summarization as a crucial NLP Task and the associated risks, such as producing incorrect, biased, or harmful summaries.
- It examines the types of work prioritized in the community, common Text Summarization Evaluation Practices, and how Ethical Issues and limitations of work are addressed.
- It details the Text Summarization Evaluation Practices, such as ROUGE Metrics, and their limitations, including potential biases and discrepancies with Human Judgment.
- It reviews existing work on RAI in automated text summarization, exploring issues like Fairness, representation of Demographic Groups, and biases in Language Varieties.
- It draws on previous NLP Meta-Analysises.
- It analyses 333 Summarization Research Papers from the ACL Anthology published between 2020 and 2022.
- It includes an Annotation Scheme that covers aspects related to paper goals, authors, Text Summarization Evaluation Practices, Stakeholders, limitations, and Ethical Considerations, providing a structured framework for analysis.
- It reveals key findings about the community's focus on developing new systems, discrepancies in Text Summarization Evaluation Practices, and a lack of engagement with Ethical Considerations and limitations in most papers.