ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Performance Metric

A ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Performance Metric is an intrinsic recall-based NLG performance measure that evaluates system-generated summarys against reference summarys.

AKA: ROUGE Metric, ROUGE Evaluation Metric, Recall-Oriented Understudy for Gisting Evaluation.
Context:
- Metric Input: System-Generated Summary, Reference Summary Set, ROUGE Configuration Parameters
  - Optional Input: ROUGE Stemming Dictionary, ROUGE Stopword List, ROUGE N-gram Length
- Metric Output: ROUGE Score, ROUGE Precision Value, ROUGE Recall Value, ROUGE F-Measure
- Metric Performance Measure: ROUGE Correlation Coefficients with human judgment scores
- ...
- It can typically compute ROUGE N-gram Overlap through ROUGE n-gram matching.
- It can typically calculate ROUGE Recall Scores through ROUGE reference comparison.
- It can typically measure ROUGE Precision Scores through ROUGE system output analysis.
- It can typically generate ROUGE F-Measures through ROUGE harmonic mean calculation.
- ...
- It can often evaluate ROUGE Multi-Document Summarys through ROUGE jackknifing procedures.
- It can often support ROUGE Language-Independent Evaluation through ROUGE character-based matching.
- It can often enable ROUGE Statistical Significance Testing through ROUGE bootstrap resampling.
- ...
- It can range from being a Simple ROUGE Metric to being a Complex ROUGE Metric, depending on its ROUGE computational complexity.
- It can range from being a Strict ROUGE Metric to being a Flexible ROUGE Metric, depending on its ROUGE matching criteria.
- It can range from being a Word-Level ROUGE Metric to being a Character-Level ROUGE Metric, depending on its ROUGE granularity level.
- ...
- It can be implemented by a ROUGE Evaluation System using ROUGE scoring algorithms.
- It can be configured through ROUGE Parameter Settings for ROUGE task-specific optimization.
- It can be validated against ROUGE Human Correlation Studys using ROUGE benchmark datasets.
- ...
Example(s):
- ROUGE N-gram-Based Variants, such as:
  - ROUGE-1, which measures ROUGE unigram overlap between ROUGE system summarys and ROUGE reference summarys.
  - ROUGE-2, which evaluates ROUGE bigram overlap for ROUGE phrase-level matching.
  - ROUGE-N, which computes ROUGE n-gram co-occurrence statistics for ROUGE flexible-length matching.
- ROUGE Sequence-Based Variants, such as:
  - ROUGE-L, which uses ROUGE longest common subsequence for ROUGE sentence-level structure similarity.
  - ROUGE-W, which applies ROUGE weighted LCS favoring ROUGE consecutive matches.
- ROUGE Skip-Based Variants, such as:
  - ROUGE-S, which calculates ROUGE skip-bigram statistics for ROUGE word-pair ordering.
  - ROUGE-SU, which combines ROUGE skip-bigrams with ROUGE unigram statistics.
- the one proposed in (Lin, 2004).
- ...
Counter-Example(s):
- BLEU Metric, which emphasizes precision-based evaluation rather than ROUGE recall-based evaluation.
- METEOR Metric, which incorporates synonym matching unlike ROUGE exact matching.
- BERTScore, which uses contextual embeddings rather than ROUGE surface-level matching.
- MAUVE Score, which measures distribution similarity rather than ROUGE n-gram overlap.
See: Automatic Summarization, Text Similarity Metric, NLG Evaluation, Recall Metric, Precision Metric, F-Measure.

References

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric) Retrieved:2023-9-11.
- ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, ^[1] is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

2023

(Wikipedia, 2023) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric)#Metrics Retrieved:2023-10-11.
- The following five evaluation metrics are available.
  - ROUGE-N: Overlap of n-grams Lin, Chin-Yew and E.H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003. between the system and reference summaries.
    - ROUGE-1: refers to the overlap of unigrams (each word) between the system and reference summaries.
    - ROUGE-2 refers to the overlap of bigrams between the system and reference summaries.
  - ROUGE-L: Longest Common Subsequence (LCS)Lin, Chin-Yew and Franz Josef Och. 2004. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 21 - 26, 2004. based statistics. Longest common subsequence problem takes into account sentence-level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
  - ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes.
  - ROUGE-S: Skip-bigram based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
  - ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.

↑ Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

2022

(Jain et al., 2022) ⇒ Raghav Jain, Vaibhav Mavi, Anubhav Jangra, and Sriparna Saha. (2022). “Widar-weighted Input Document Augmented Rouge.” In: European Conference on Information Retrieval, pp. 304-321 . Cham: Springer International Publishing,
- ABSTRACT: The task of automatic text summarization has gained a lot of traction due to the recent advancements in machine learning techniques. However, evaluating the quality of a generated summary remains to be an open problem. The literature has widely adopted Recall-Oriented Understudy for Gisting Evaluation (ROUGE) as the standard evaluation metric for summarization. However, ROUGE has some long-established limitations; a major one being its dependence on the availability of good quality reference summary. In this work, we propose the metric WIDAR which in addition to utilizing the reference summary uses also the input document in order to evaluate the quality of the generated summary. The proposed metric is versatile, since it is designed to adapt the evaluation score according to the quality of the reference summary. The proposed metric correlates better than ROUGE by 26%, 76%, 82%, and 15%, respectively, in coherence, consistency, fluency, and relevance on human judgement scores provided in the SummEval dataset. The proposed metric is able to obtain comparable results with other state-of-the-art metrics while requiring a relatively short computational time (Implementation for WIDAR can be found at - https://github.com/Raghav10j/WIDAR).

2017

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric) Retrieved:2017-5-14.
- ROUGE, or Recall-Oriented Understudy for Gisting Evaluation, ^[1] is a set of metrics and a software package used for evaluating automatic summarization and machine translation software in natural language processing. The metrics compare an automatically produced summary or translation against a reference or a set of references (human-produced) summary or translation.

↑ Slides of talk by Chin-Yew Lin

2017

(Wikipedia, 2017) ⇒ https://en.wikipedia.org/wiki/ROUGE_(metric)#Metrics Retrieved:2017-5-30.
- The following five evaluation metrics ^[1] are available.
- ROUGE-N: N-gram ^[2] based co-occurrence statistics.
- ROUGE-L: Longest Common Subsequence (LCS) ^[3] based statistics. Longest common subsequence problem takes into account sentence level structure similarity naturally and identifies longest co-occurring in sequence n-grams automatically.
- ROUGE-W: Weighted LCS-based statistics that favors consecutive LCSes .
- ROUGE-S: Skip-bigram ^[4] based co-occurrence statistics. Skip-bigram is any pair of words in their sentence order.
  - ROUGE-SU: Skip-bigram plus unigram-based co-occurrence statistics.
- ROUGE can be downloaded from berouge download link.

2004a

(Lin, 2004) ⇒ Chin-Yew Lin. (2004). "ROUGE: A Package for Automatic Evaluation of Summaries". In: Text summarization branches out: Proceedings of the ACL-04 workshop.

2004b

(Lin, 2004) ⇒ Chin-Yew Lin. (2004)."Looking for a Few Good Metrics: Automatic Summarization Evaluation - How Many Samples Are Enough?". In: Proceedings of the Fourth NTCIR Workshop on Research in Information Access Technologies Information Retrieval, Question Answering and Summarization (NTCIR 2004).

[1] Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

[2] Slides of talk by Chin-Yew Lin

[3] Lin, Chin-Yew. 2004. ROUGE: a Package for Automatic Evaluation of Summaries. In: Proceedings of the Workshop on Text Summarization Branches Out (WAS 2004), Barcelona, Spain, July 25 - 26, 2004.

[4] Lin, Chin-Yew and E.H. Hovy 2003. Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada, May 27 - June 1, 2003.

[5] Lin, Chin-Yew and Franz Josef Och. 2004a. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 21 - 26, 2004.

[6] Lin, Chin-Yew and Franz Josef Och. 2004a. Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics. In: Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), Barcelona, Spain, July 21 - 26, 2004.

[1]

[1]

[1]

[2]

[3]

[4]

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Performance Metric

References

2023

2023

2022

2017

2017

2004a

2004b

Navigation menu

Search