ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Performance Metric

A ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Performance Metric is an intrinsic NLG performance measure against a Gold standard text.

Context:
- output: ROUGE Score.
- It can (typically) involve comparing the overlap of content between a System-Generated Summary and a set of Reference Summaries.
- ...
Example(s):
- the one proposed in (Lin, 2004).
- ROUGE-N: N-gram-based Co-Occurrence Statistic [1], extends the basic calculation to n-grams, providing a measure of the overlap at the level of n-word sequences.
- ROUGE-L: Uses Longest Common Subsequence (LCS)-based statitics [2], focusing on the longest sequence of words found in both the system-generated and reference summaries.
- ROUGE-W: Weighted LCS-based statistics that favors longer consecutive (LCS) matches.
- ROUGE-S: Skip-bigram based co-occurrence statistics [3]. Skip-bigram is any pair of words in their sentence order.
  - ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
- …
Counter-Example(s):
- BLEU Metric,
- NIST Metric,
- METEOR Metric.
- MAUVE Metric (MAUVE score).
See: Automatic Summarization, Machine Translation.

References

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric) Retrieved:2023-9-11.
- ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, ^[1] is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric)#Metrics Retrieved:2023-10-11.
- The following five evaluation metrics are available.
  - ROUGE-N: Overlap of n-grams Lin, Chin-Yew and E.H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003. between the system and reference summaries.
    - ROUGE-1: refers to the overlap of unigrams (each word) between the system and reference summaries.
    - ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
  - ROUGE-L: Longest Common Subsequence (LCS)Lin, Chin-Yew and Franz Josef Och. 2004. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 21 - 26, 2004. based statistics. Longest common subsequence problem takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
  - ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes.
  - ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
  - ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.

↑ Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

2022

(Jain et al., 2022) ⇒ Raghav Jain, Vaibhav Mavi, Anubhav Jangra, and Sriparna Saha. (2022). “Widar-weighted Input Document Augmented Rouge.” In: European Conference on Information Retrieval, pp. 304-321 . Cham: Springer International Publishing,
- ABSTRACT: The task of automatic text summarization has gained a lot of traction due to the recent advancements in machine learning techniques. However, evaluating the quality of a generated summary remains to be an open problem. The literature has widely adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as the standard evaluation metric for summarization. However, ROUGE has some long-established limitations; a major one being its dependence on the availability of good quality reference summary. In this work, we propose the metric WIDAR which in addition to utilizing the reference summary uses also the input document in order to evaluate the quality of the generated summary. The proposed metric is versatile, since it is designed to adapt the evaluation score according to the quality of the reference summary. The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores provided in the SummEval dataset. The proposed metric is able to obtain comparable results with other state-of-the-art metrics while requiring a relatively short computational time (Implementation for WIDAR can be found at - https://github.com/Raghav10j/WIDAR).

2017

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric) Retrieved:2017-5-14.
- ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, ^[1] is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

↑ Slides of talk by Chin-Yew Lin

2017

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric)#Metrics Retrieved:2017-5-30.
- The following five evaluation metrics ^[1] are available.
- ROUGE-N: N-gram ^[2] based co-occurrence statistics.
- ROUGE-L: Longest Common Subsequence (LCS) ^[3] based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
- ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .
- ROUGE-S: Skip-bigram ^[4] based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
  - ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
- ROUGE can be downloaded from berouge download link.

2004a

(Lin, 2004) ⇒ Chin-Yew Lin. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries". In: Text summarization branches out: Proceedings of the ACL-04 workshop.

2004b

(Lin, 2004) ⇒ Chin-Yew Lin. (2004)."Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?". In: Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization (NTCIR 2004).

[1] Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

[2] Slides of talk by Chin-Yew Lin

[3] Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

[4] Lin, Chin-Yew and E.H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003.

[5] Lin, Chin-Yew and Franz Josef Och. 2004a. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 21 - 26, 2004.

[6] Lin, Chin-Yew and Franz Josef Och. 2004a. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 21 - 26, 2004.

[1]

[1]

[1]

[2]

[3]

[4]

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Performance Metric

References

2023

2023

2022

2017

2017

2004a

2004b

Navigation menu

Search