2022 NewsSummarizationandEvaluationi

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

  • It conducts a systematic study comparing GPT-3, a large prompt-based language model, to state-of-the-art fine-tuned models like BRIO on news summarization. It uses both automatic metrics and human evaluations to compare summary quality.
  • It finds through a robust human study that people strongly prefer GPT-3 summaries for news articles over fine-tuned models like BRIO, even though GPT-3 scores much lower on standard automatic metrics like ROUGE. For example, humans preferred GPT-3 summaries 70% of the time for CNN articles.
  • It shows convincingly that current automatic evaluation metrics, including both reference-based metrics like ROUGE and reference-free metrics like factuality classifiers, fail to reliably assess the quality of GPT-3 summaries. GPT-3 scored 7 ROUGE points lower but was still overwhelmingly preferred by humans.
  • It demonstrates that GPT-3 shows promising results on specialized summarization tasks like keyword-based summarization through prompting, outperforming fine-tuned models like CTRLSum. However, it struggles with aspect-based summarization using simple prompts.
  • It argues compellingly that large prompted models like GPT-3 represent a fundamental paradigm shift in summarization, changing the data and methods needed. For example, commonly used datasets like CNN/DM may not produce the style of summaries users prefer.
  • It suggests more focus should be on developing summarization systems for real-world use cases rather than incremental improvements on benchmark datasets. Prompting provides more flexibility here.
  • It releases a valuable dataset of model summaries from different systems and human preference judgments to facilitate further research in this direction.

Cited By

Quotes

Abstract

The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant fine-tuning approaches compare to prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and prompt-based models across 4 standard summarization benchmarks, (b) 1K human preference judgments comparing different systems for generic- and keyword-based summarization [1]

Introduction

Fine-tuning pre-trained models on domain-specific datasets has been the leading paradigm in text summarization research in recent years (Lewis et al., 2020; Zhang et al., 2020; Raffel et al., 2020). These models generate high-quality summaries on standard benchmarks, but still require sizeable training datasets to adapt to new settings, e.g., summarizing data from a new source domain or producing a summary in a different style. The success of prompting large language models (GPT-3 (Brown et al., 2020), T0 (Sanh et al., 2022), PaLM (Chowdhery et al., 2022), etc.) provides an alternative approach, namely learning from natural language task instructions and/or a few demonstrative examples in the context without updating model parameters. While recent work (Zhao et al., 2021; Min et al., 2022; Ye and Durrett, 2022) has evaluated this paradigm across a number of tasks, it has only been studied for text summarization with unreliable automatic metrics (He et al., 2022b; Chowdhery et al., 2022; Ouyang et al., 2022) or in non-standard settings (Saunders et al., 2022).

In this paper, we conduct the first systematic study of the impact of prompt-based models on the text summarization research space, using an Instruct-tuned 175B GPT-3 model (text-davinci- 002) (Brown et al., 2020; Ouyang et al., 2022) as a case study. Figure 1 shows that GPT-3 summaries are extremely high-quality and adaptable to different summarization settings. Starting from these observations, we aim to answer three main questions. First, how do prompt-based GPT-3 sum- maries compare to those obtained from state-of-the-art fine-tuned summarization models (Zhang et al., 2020; Liu et al., 2022)? We compare these approaches using A/B testing on a new corpus of recent news articles, and find that our study participants overwhelmingly prefer GPT-3 sum- maries across two different “styles” with different prompts (three-sentence and single-sentence). Moreover, these summaries do not suffer from limitations due to low-quality training data that plague fine-tuned generic summarization models (Maynez et al., 2020; Goyal et al., 2022).

CNN article: https://www.cnn.com/2022/09/09/politics/judge-throws- out-trumps-rico-lawsuit-against-hillary-clinton-and-democrats/index.html

Length-constrained prompt

Summarize the above article in 2 sentences.

GPT-3: Donald Trump's lawsuit against Hillary Clinton, the Democratic National Committee, and several other individuals and entities has been dismissed by a federal judge. The judge ruled that Trump's claims are unsupported by any legal authority and that many of his specific factual assertions are implausible or unsupported.

Keyword-constrained prompt

Summarize the above article briefly focusing on Alina Habba.

GPT-3: Alina Habba is an attorney for Donald Trump. She stated that Trump will "immediately move to appeal this decision." Habba also said that the judge's opinion is "rife with erroneous applications of the law" and that it "disregards the numerous independent governmental investigations which substantiate our claim that the defendants conspired to falsely implicate our client and undermine the 2016 Presidential election."

Figure 1: Examples of GPT-3 summaries. We can gen- erate summaries following style constraints or queries included in the prompts, allowing us to emulate a range of existing fine-tuned systems.

Second, are existing automatic metrics well-suited to evaluating prompt-based summaries? Re- cent work has shown that classic reference-based such as ROUGE (Lin, 2004) and BERTScore (Zhang* et al., 2020) are unreliable when small improvements are reported (Peyrard, 2019; Fabbri et al., 2021); however large differences, on the order of say 5 ROUGE points or greater, are considered to be correlated with human preferences (Bhandari et al., 2020; Deutsch et al., 2022). However, we find that the same is no longer true when evaluating GPT-3 summaries. These summaries score much lower on automatic metrics (7 ROUGE-L points on average) than all prior state-of-the-art models while comfortably outperforming them on human evaluation. Furthermore, we show that recent reference-free metrics, e.g. QA-based metrics (Fabbri et al., 2022; Durmus et al., 2020) and trained factuality models (Kryscinski et al., 2020; Goyal and Durrett, 2020), similarly fail to adapt to this shift from the fine-tuned to prompting, and need to be re-visited.

Finally, how can prompting be used beyond generic summarization? We focus on keyword-based and aspect-based summarization. For keyword-based summarization, we find that GPT-3 consistently generates more coherent and keyword-relevant summaries compared to current fine-tuned alternatives: crowd annotators prefer GPT-3 summaries over a baseline model (He et al., 2022a) 70% of the time. We observe mixed results for the aspect-based setting, where GPT-3 summaries show frequent failure cases with simple prompts.

Taken together, this evidence suggests that GPT- 3 represents a fundamental paradigm shift in summarization, changing what data we need (or don’t need) and what approaches we can now explore. Evaluating these systems will require a new framework distinct from the automatic metrics that have dominated the last decade of summarization research.

Dataset

Avg. Words

Article Summ

% novel n-grams

n = 1 n = 2

CNN

760.5

45.7

16.7

54.3

DailyMail

653.3

54.6

17.0

53.8

XSum (BBC)

431.1

23.2

35.7

82.4

Newsroom

658.6

26.7

18.9

47.5

Table 1: Basic statistics of standard summarization datasets: CNN/DM (Hermann et al., 2015; Nallapati et al., 2016), XSum (Narayan et al., 2018), Newsroom (Grusky et al., 2018). These show large variance in their sum- mary properties and fundamentally differ in their defini- tion of the “gold” standard.

2 Models and Setup

2.1 Current Paradigms for Summarization

(Task-specific models trained for each dataset)

BART ‣ T5 ‣ T0

Instruct-GPT

PEGASUS ‣ CTRLSum ‣ FLAN (these not trained

BRIO

on standard summ.

used during training

datasets)

text-davinci-002

tasets

Summarization da

(Not available or less effective than instruction-tuned counterparts)

GPT-3

PaLM

Turing-NLG

Zero-shot prompting

Pre-trained LM

Fine-tuned on summ. datasets

Prompting

Instruction-tuned on multiple tasks

Recent zero- and few-shot prompting based mod- els (Brown et al., 2020; Sanh et al., 2022), have shown impressive generalization capabilities on unseen tasks specified using prompts alone and without performing any gradient updates (Mishra et al., 2022). In this work, we want to compare their text summarization performance against the current state-of-the-art models.

Figure 2: Broad categorization of available summariza- tion systems; those compared in this work are high- lighted in red.

Figure 2 shows the broad categories of all avail- able summarization approaches, including current SOTA models and prompting-based models. The former set consists of fine-tuned language mod- els, trained on a large number of article-summary pairs (e.g. BART (Lewis et al., 2020), PEGASUS (Zhang et al., 2020), BRIO (Liu et al., 2022)) to obtain dataset-specific systems. This category also includes models aimed at tasks beyond generic summarization, such as keyword- or query-based summarization, that still rely on standard datasets for training (He et al., 2022a).

On the other extreme are zero- or few-shot models, (e.g. GPT3 (Brown et al., 2020), PaLM (Chowdhery et al., 2022)), that are not explicitly trained for any particular task, as discussed above.

Recent work (Ouyang et al., 2022; Wei et al., 2022; Sanh et al., 2022) has improved on these models by introducing instruction-tuned models. Here, pre-trained language models are fine-tuned on mul- tiple tasks (which may include summarization) us- ing instruction templates in order to align their training with inference time usage.

In this work, we compare the summarization performance of three models that are representative of this space of options:

OpenAI’s text-davinci-002, a GPT-3 model (Brown et al., 2020) from the Instruct series (Ouyang et al., 2022). While we do not know the exact training details for this release of the model, the previous in the series (text- davinci-001) was fine-tuned on a combina- tion of prompts submitted to their API and la- beler prompts spanning multiple tasks. These tasks include summarization but not (to our knowledge) standard summarization datasets like CNN/DM (Hermann et al., 2015; Nallapati et al., 2016) or XSum (Narayan et al., 2018).

We choose the text-davinci-002 version for our experiments in order to benchmark the best available prompt-based model.2 We refer to this approach as GPT3-D2.

BRIO (Liu et al., 2022), a fine-tuned summariza- tion model that reports state-of-the art results on both CNN/DM and XSum. We will use versions of this model fine-tuned on each of these two datasets.

T0 (Sanh et al., 2022), a prompt-based model fine-tuned on multiple tasks including standard summarization datasets. This provides a use- ful point of comparison between task-specific fine-tuned (BRIO) and bigger instruction-tuned models (GPT3-D2).

2.2 Using GPT3-D2 for summarization

Fine-tuned models largely follow the “style” of ref- erence summaries in their training data, and hence, generated summaries show large variance between datasets (see Table 1 for basic summary statistics of standard summarization datasets). To ensure fair comparison between these and GPT3-D2, we adapt the latter’s prompt to align with dataset-specific styles.

2We did not observe obvious quality differences in gen- erated summaries between text-davinci-001 and text-davinci-

002. Examples are included in Appendix C.

Article: https://www.cnn.com/2022/03/01/africa/africa-condemns- racism-ukraine-intl/index.html

N = 1

Prompt: Summarize the article in N sentences.

The three African nations on the UN Security Council condemned reports of discrimination against African citizens at the Ukrainian border during a meeting at the UN HQ in New York City Monday.

N = 3

N = 2

The United Nations Security Council condemned the reports of discrimination against African citizens at the Ukrainian border. The African Union has said it is "disturbed" by the reports of segregation against Africans in Ukraine, which it described as "shockingly racist."

The article discusses the reports of discrimination against African citizens at the Ukrainian border. The representatives from the three African nations on the UN Security Council condemned the reports and called for the mistreatment of African peoples on Europe's borders to cease immediately. Foreign students attempting to flee Ukraine after Russia invaded the country told CNN that they experienced racial discrimination at the Ukrainian border.

Figure 3: Illustration of length control using the task description / prompt for GPT3-D2. We found that the generated summaries followed the given sentence length constraint 98% of the time, allowing us to generate different length summaries emulating different datasets.

Specifically, we follow prior work (Sanh et al., 2022) and use sentence-count length prompts to adapt to each dataset. Although these datasets also differ along other attributes, e.g. CNN/DM is lead- biased whereas XSum requires drawing inferences from a whole article, we do not attempt to con- trol any other attributed of the summary. Figure 3 shows an example of different length GPT3-D2 sum- maries for the same news article, using the follow- ing prompt format:

Article: Template:Article
Summarize the above article in N sentences.

We found that GPT3-D2 summaries faithfully fol- low the given length constraint in 98% of the test instances used in our human study data in Sec- tion 3.

Given this setup, we first compare the summary quality of the three summarization models through a human annotation study (Section 3). Then, we evaluate the current suite of summarization metrics for prompt-based summarization (Section 4). Fi- nally, in Section 5, we briefly discuss GPT3-D2 per- formance on summarization tasks beyond generic summarization and new challenges.

3 Human evaluation of GPT3-D2 summaries =

Generated summaries of fine-tuned models (Lewis et al., 2020; Zhang et al., 2020; Liu et al., 2022) emulate gold-standard summaries in their training datasets. In contrast, prompt-based GPT3-D2 mod- els generate summaries based on how the given

BBC Article: The full opening of a landmark shared education campus in Omagh is facing another delay. Education Minister Michelle McIlveen has now said the Strule Shared Education Campus is planned to open in September 2026. The minister clarified the new date in response to an assembly question from the SDLP MLA Daniel McCrossan. The campus had originally been due to open in 2020, but has been delayed a number of times.[…]

BRIO

The full opening of the Strule shared education campus in Omagh, County Tyrone, has been delayed to September 2026.

T0

Strule, Northern Ireland's biggest ever school-building project, is now scheduled to open in 2026, the Education Minister has said.

GPT3 The Strule Shared Education Campus is facing another delay, and is now

-D2 planned to open in September 2026.

CNN Article: (CNN) Mortgage rates fell slightly this week, marking the third consecutive week of declines. But with rates hovering above 5% and home prices well above where they were this time last year, prospective buyers are finding it increasingly difficult to afford a home. The 30-year, fixed-rate mortgage averaged 5.09% in the week ending June 2, down from 5.10% the week before, according to Freddie Mac. It is still well above the 2.99% average from this time last year. […] Rising prices had already been pushing many prospective buyers to the sidelines. […] The Fed has been seeking to tame inflation by raising interest rates over the past couple of months. And the central bank has signaled there are more rate hikes to come. […]

The 30-year, fixed-rate mortgage averaged 5.09% in the week ending

BRIO June 2. It's the third consecutive week of declines. But rates are still

above 5% and home prices are well above where they were this time last year. Rising prices have pushed some prospective buyers to the sidelines.

30-year, fixed-rate mortgage averaged 5.09% in the week ending June 2.

T0 It is still well above the 2.99% average from this time last year. Rising prices had already been pushing many prospective buyers to the sidelines.

Mortgage rates have fallen slightly in the past few weeks, but they are

GPT3 still well above last year's levels. This is making it difficult for many

-D2 prospective buyers to afford a home. The Fed has been raising interest rates in an effort to control inflation, and more rate hikes are expected.

Figure 4: Examples of CNN-style and BBC/XSum-style summaries for the three systems. For CNN, we observe that models fine-tuned on the CNN/DM training set reflect its dataset biases; summaries are highly extractive, specific and lead-biased. On the other hand, GPT3-D2 summaries contain fewer specific details but cover more content.

task description surfaces behavior learned during pre-training or instruction-tuning. In this section, we ask: how do these paradigms compare? Does learning from gold summaries lead to a better sum- marization model? To answer this, we conduct a human study to compare outputs of our 3 repre- sentative models and collect human preferences of quality.

Experimental Setup Datasets for fine-tuning We choose two stan- dard fine-tuning datasets whose summaries differ along multiple dimensions such as length and ab- stractiveness: CNN/DM (Hermann et al., 2015; Nallapati et al., 2016) contains reference summaries that are approximately 3-4 sentences long. Sum- maries in this dataset are highly extractive and lead-biased.

XSum (Narayan et al., 2018) contains 1 sen- tence summaries of BBC news articles. In this dataset, references summaries, and conse- quently generated summaries from fine-tuned models are highly abstractive.

Datasets for evaluation Because GPT3-D2’s pre- training and instruction-tuning datasets are un- known, it may have been trained on existing articles and summaries in the test splits of these standard benchmarks. We therefore run our human study on

100 recent articles from CNN3 and BBC, collected between March 1, 2022 and June 31, 2022. We call these CNN-2022 and BBC-2022 respectively.

Model details We use the publicly released BRIO-XSum and BRIO-CNN/DM models to generate summaries.4 For T0, we use a prompt we selected from its prompt repository for CNN/DM and XSum datasets.5 Finally, to generate GPT3-D2 summaries, we set N = 3 for CNN and N = 1 for BBC in our standard sentence-count prompt template from Section 2.

For a maximally fair comparison in this “realis- tic” setting, we take some additional steps to im- prove the output of BRIO-XSum. In order to auto- mate dataset creation, XSum removes the first sen- tence from news articles to use as the gold summary for training, then treats the rest of the sentences as the article to summarize. This setup differs from the real world usage of summarization systems where the complete article is summarized. Due to this mismatch, BRIO-XSum often generates very low quality outputs, e.g. All images: Strule Shared

3Although the BRIO’s CNN/DM model also includes Daily- Mail data in its training, we do not use this news source in our study as it is now widely considered to be unreliable. E.g. according to Media Bias / Fact Check site, DM’s factual re- porting is rated ‘low’ https://mediabiasfactcheck.com/ daily-mail/.‌

4Models at: https://github.com/yixinL7/BRIO

5Repository with T0 prompts: https://github.com/ bigscience-workshop/promptsource

Education Campus in Figure 4, for around 30% of the articles. We manually identify these examples and first attempt to fix them by selecting a summary without such obvious failures from further down the beam (we use beam size = 10). However, if we cannot find a “better” summary, we remove the first sentence of the article and re-sample a new sum- mary to align with its noisy training. This latter strategy often results in factually incorrect sum- mary generations, as is well documented in prior research (Maynez et al., 2020; Goyal and Durrett, 2021).

Design of the human study We design an A/B test to collect preference annotations. For each given article, annotators are shown summaries from all three summarization systems (BRIO, T0 and GPT3-D2). They are then asked to select their most and least preferred summary or summaries. In ad- dition to these multiple choice questions, we also ask for a free-text justification of both choices.

We make two design decisions for our human study: first, we do not provide annotators with spe- cific definitions of summary quality to avoid intro- ducing our own biases. It is also quite challenging to produce a unified definition of quality for the very different “styles” of summaries evaluated in this study. Instead, we ask them to rely on their own preferences based on summaries they would like to see if they were browsing the web, which we believe to be a representative scenario for non- expert consumers of news summaries. Detailed task instructions are included in Appendix F.

Second, we allow multiple selections for both the best and worst summary questions to cater to sce- narios in which different summarization systems output similar quality summaries without meaning- ful differences.

We hire crowd annotators through Prolific. For both CNN and BBC, we recruit 60 unique partici- pants to annotate the 100 summaries in each dataset. Each annotator was asked to annotate 5 articles and each article was annotated by 3 annotators. Addi- tionally, we use the Prolific’s demographic filters to restrict participation to USA (or UK) residents for CNN (or BBC). We anticipate that residents from these respective countries are better positioned to understand country-specific news events and evalu- ate their summaries. Participants were paid approx- imately $11/hr for their work.

Model

Length Statistics

  1. sent #words/sent

% novel n-gms

n = 1 n = 2

  1. NEs per

100 words

CNN image

BRIO

3.7

15.8

12.1

36.2

12.9

T0

2.7

14.9

16.4

45.2

12.8

GPT3-D2

2.9

23.4

16.3

40.7

10.5

BBC image

BRIO

1.0

20.2

24.6

61.2

9.1

T0

1.0

20.0

26.3

66.7

9.8

GPT3-D2

1.0

27.7

16.4

42.3

8.5

Table 2: Statistics for generated summaries evaluated in the human study across all datasets and summariza- tion systems. We observe that GPT3-D2 generated sum- maries nearly always follow the sentence length con- straints in their prompts.

3.2 Results

Differences between summarization systems

Figure 4 shows examples of generated summaries from all three summarization systems for both CNN and BBC articles. For CNN, we observe that fine-tuned BRIO summaries tend to be highly extrac- tive and generally include a high number of named entities (dates, percentages, names), reflecting the data it was trained on. In contrast, GPT3-D2 sum- maries are more abstractive and less specific, but provide a more exhaustive overview of the article content. Table 2 provides quantitative evidence of this; we use percentage of novel n-grams to mea- sure abstractiveness, and number of named entities per 100 words to measure specificity.

For BBC, we observe inverse trends where BRIO and T0 are more abstractive compared to GPT3-D2. Again, this can be attributed to the XSum training data used to train both these prior mod- els. For GPT3-D2 summaries, on the other hand, the level of abstractiveness does not differ between datasets. Finally, Table 2 shows that GPT3-D2 sum- maries tend to have longer sentences, and therefore similar number of summary sentences often results in a longer summary for both datasets. We study the effect of this length difference on human pref- erence judgments in Appendix B.

Which systems do humans prefer?

Results of our human study are summarized in Table 3. We report the percentage of times a particular system is the most/least preferred model according to major- ity vote combining all three annotator’s choices.6

6As we allow multiple system selections, note that more that one system could be the majority. However, this is rare after majority vote: only 2% of the articles in CNN and 7% in

Dataset

Brio

Best ↑ Worst ↓

T0

Best ↑ Worst ↓

GPT3

Best ↑ Worst ↓

CNN

36

24

8

67

58

9

BBC

57

15

20

56

30

29

BBC

CNN

Table 3: Percentage of times a summarization system is selected as the best or worst according to majority vote (may be tied). Human annotators have a clear preference for GPT3-D2 for both CNN and BBC style summaries.

Across both datasets and styles, we observe a clear

Which summary is the most preferred?

GPT3 GPT3

BRIO BRIO

T0 T0

Agreement = 0.05 Agreement = 0.11

GPT3

BRIO T0

GPT3

BRIO T0

Agreement = 0.18 Agreement = 0.15

image 0 image 1 image 2 image 3

No. of annotator votes for

“best summary”

Which summary is the least preferred?

image 0 image 1 image 2 image 3

No. of annotator votes for

“worst summary”

preference for GPT3-D2 summaries compared to the other two models. In fact, in both scenarios, the GPT3-D2 outperforms the next best model by at least 20 percentage points. This improvement is sta- tistically significant according to a paired bootstrap test (CNN p−value = 2 × 10−3, BBC p−value

= 6 × 10−4).

Note that the next best model differs between the two datasets. For BBC, annotators prefer T0 sum- maries over BRIO. Annotator rationales often men- tioned misleading or incorrect information as the primarily reason for selecting BRIO as the worst summary, confirming the issues that have been ob- served with XSum-trained models (Maynez et al., 2020; Pagnoni et al., 2021; Goyal and Durrett, 2021). Although T0 also includes XSum training data, we hypothesize that its multi-task framework helps offset the noisy signal from XSum.

In contrast, annotators rate T0 as the worst sum- marization system for CNN. The most common rationales for these were shorter length and inclu- sion of irrelevant details, e.g. long quotes, while missing key points. Some annotators also com- mented that these T0 summaries were less coherent compared to the other models. Interestingly, we did not observe similar complaints for the single- sentence T0 summaries for BBC.

Do annotators agree with each other? To study this, we plot the distribution of annotator votes for each summarization system and dataset in Figure 5. Additionally, we report the inter-annotator agree- ment, measured using Krippendorff’s alpha with MASI distance (Passonneau, 2006), to account for multiple selections of best or worst summary al- lowed in our study design.

The vote distribution shows that although more annotators prefer GPT3-D2 summaries, this choice is only unanimous, i.e. supported by all three an- notators, for less that 30% of the annotated articles.

BBC have multiple best summaries.

Figure 5: Annotator vote distribution for best and worst summaries across all datasets and models. Although GPT3-D2 is the clear winner according to majority vote, this choice is unanimous for less than 30% of the ar- ticles. This demonstrates the inherent variance in dif- ferent annotators’ definitions of “best summary”, espe- cially when comparing high-quality summaries from strong models.

Conversely, although BRIO (or T0) summaries are less preferred than GPT3-D2 for the CNN (or BBC) dataset on aggregate, they were voted as the best summary by at least one annotator for more than 60% of the articles. This demonstrate two things: first, when comparing summaries from two strong models, the choice is inherently ambiguous (similar observations in Clark et al. (2021)). Second, these results and the diversity in the written rationales, show that there does not exist a universal definition of a “good” summary and that different summary properties appeal to different annotators. Regard- less, the aggregate preference for GPT3-D2 is high enough across the board to give us confidence in its strength.

How do these results impact the field? Progress in text summarization research in the last five years has been enabled by the construction of large-scale text summarization datasets that involved scrap- ing news articles and pairing them with any avail- able summary-like data (Hermann et al., 2015; Narayan et al., 2018; Grusky et al., 2018). The CNN/DM dataset considers bullet points accompa- nying news articles as its summary. These “gold” standard summaries provided useful training sig- nal to train impressive supervised models (Lewis et al., 2020; Zhang et al., 2020; Liu et al., 2022) and hence, their quality or alignment with human preferences was largely ignored.

We found that, despite its popularity, XSum is largely unsuitable for fine-tuning models like BRIO

PEGASUS

34.85/14.62/28.23

.24

7.1

.858

.229

.105

.160

CNN

BRIO

T0

38.49/17.08/31.44

.31

6.6

.864

.261

.137

.211

35.06/13.84/28.46

.25

5.9

.859

.238

.099

.163

GPT3-D2

31.86/11.31/24.71

.25

3.8

.858

.216

.098

.159

PEGASUS

45.77/23.00/36.65

.33

12.2

.865

.308

.159

.229

DailyMail

BRIO

T0

49.27/24.76/39.21

.37

11.7

.871

.331

.175

.259

42.97/19.04/33.95

.28

8.9

.863

.290

.121

.184

GPT3-D2

38.68/14.24/28.08

.26

6.6

.859

.248

.101

.159

PEGASUS

47.97/24.82/39.63

.36

9.8

.901

.362

.145

.221

XSum

BRIO

T0

49.66/25.97/41.04

.39

10.6

.901

.372

.139

.224

44.20/20.72/35.84

.34

8.0

.896

.340

.125

.208

GPT3-D2

28.78/7.64/20.60

.19

2.2

.869

.197

.066

.119

PEGASUS

39.21/27.73/35.68

.39

.14

.873

.272

0.182

0.253

Newsroom

BRIO

T0

-

-

-

-

-

-

-

25.64/9.49/21.41

.20

.04

.849

.145

.080

0.125

GPT3-D2

27.44/10.67/22.18

.22

.05

.859

.159

.089

0.142

Dataset Model Overlap-Based ROUGE(1 /2/L) METEOR Bleu Similarity-Based BER TScore MoverScor QAEval e EM F1 Table 4: Performance of different summarization systems measured using reference-based automatic metrics. Across all datasets, we observe that automatic metrics report substantially worse results for GPT3-D2 summaries compared to fine-tuned models. This directly contradicts the human preference results from Section 3, demonstrating that these reference-based metrics cannot reliably compare the quality of prompt-based summaries against fine-tuned summaries.

for realistic summarization settings. Even though a CNN/DM-trained BRIO model performed better, the results of our human study question the contin- ued utility of hill-climbing on this dataset, as it seems users may simply prefer a different style of summary altogether. In fact, this preference for GPT3-D2 is much larger than incremental improve- ments reported in other human evaluation settings,

e.g. improvements on XSum on the GENIE leader- board (Khashabi et al., 2022). Furthermore, as we we will see in Section 5, the greater flexibil- ity of GPT3-D2 compared to these systems makes it more suitable for news summarization tasks be- yond generic summarization.

If a system designer collects a large-scale dataset of high-quality summaries that they wish to emu- late, we believe a fine-tuned system may outper- form GPT3-D2. However, better-trained models on datasets collected via “incidental” supervision are less likely to help.

4 Can current automatic metrics evaluate GPT3-D2 summaries?

Automatic metrics proposed for summarization evaluation can be broadly divided into two categories: (1) reference-based, that compare generalized summaries against available gold summaries, and (2) reference-free that only rely on the input document. Here, we compare their performance at evaluating prompt-based GPT3-D2 summaries.

Experimental Setup

We evaluate automatic metrics using summaries from 4 different summarization datasets, listed in Table 1. For each dataset, we construct our evaluation sets by randomly sampling 500 [2] articles from the standard test split.8 We compare the same 3 summarization systems from Section 3 in our analysis. Additionally, we also report results using the fine-tuned PEGASUS model (Zhang et al., 2020), as BRIO fine-tuned models are not available for all datasets.

We publicly release this corpus of summarization outputs to standardize the test sets and support future research into GPT3-D2 based summarization. Link: https://tagoyal.github.io/zeroshot-news-annotations.html.

4.1 Reference-based metrics

Here, we study if the gold summaries of the standard datasets are useful for evaluation, especially when evaluating prompt-based summaries that are not trained to emulate the gold. We benchmark

8Note that these standard datasets were released before 2020. Therefore, it is possible that some article-summary pairs in our test set overlap with GPT3-D2’s training data. How- ever, we do not observe a qualitative difference in GPT3-D2’s performance on these older articles.

Dataset

Model

Overall Quality

SUPERT BLANC

Factuality (QA-based)

QuestEval QAFactEval

Factuality (NLI-based)

FactCC DAE SummaC

CNN

PEGASUS

BRIO T0

GPT3-D2

.5466 .0605

.7373 4.4071

.3743 .8223 .1138

.5586 .0802

.7334 3.8332

.1817 .7577 -.0532

.2012 .7556 -.0605

.2428 .6671 -.0729

.5330 .0558

.5560 .0749

.7799

3.7517

.7249 3.6399

PEGASUS

.6433

.1137

.7536

4.4677

.5152

.8497

.2402

DailyMail

BRIO

T0

.6360

.5995

.1217

.0889

.7415

.7803

4.1362

3.9827

.3699

.2431

.8118

.8043

.0153

.0478

GPT3-D2

.6118

.0983

.7461

3.8279

.2697

.6990

.0365

PEGASUS

.4439

.0249

.8233

2.0089

.2465

.3598

-.2993

XSum

BRIO T0

.4459

.4538

.0230

.0238

.2031

.2219

.3040

.3392

-.3292

-.3037

.8305

1.8626

.7957

2.0330

GPT3-D2

.5060

.0594

.8064

2.9492

.3977

.6372

-.2626

PEGASUS

.6286

.1131

.7118

4.2120

.7218

.7956

.2418

Newsroom

BRIO

T0

-

.5433

-

.0640

-

-

-

-

-

.7511

3.5799

.2828

.7376

.0261

GPT3-D2

.5408

.0599

.7160

3.2336

.3988

.6564

-.0729

Table 5: Performance of different summarization systems, as scored by automatic reference-free evaluation metrics from the summarization literature. Similar to reference-based metrics, these also generally fail to produce the same system rankings as human preferences reliably across datasets.

the performance of 3 different summarization met- rics: (1) overlap-based metrics, specifically ROUGE (Lin, 2004) METEOR (Banerjee and Lavie, 2005) and BLEU (Papineni et al., 2002). (2) similarity-based metrics, that compute similarity between embed- dings representations of generated and reference summaries. Specifically, we report BERTScore (Zhang* et al., 2020) and MoverScore (Zhao et al., 2019). (3) a QA-based metric, specifically QAE- val (Deutsch et al., 2021). Although most QA- metrics are reference-free (discussed in Section 4.2), QAEval uses the reference summaries to in- dicate saliency. We report both exact match (EM) and F1 components of QAEval.

Results Table 4 outlines the results. It shows that BRIO and PEGASUS models, fine-tuned to emulate the reference summaries, outperform GPT3-D2 sum- maries according to all reference-based automatic metrics. The difference in their assigned scores is very high, e.g. >7 ROUGE-L points between GPT3-D2 and BRIO. For comparison, these reported scores for GPT3-D2 are even lower than the trivial Lead-3 baseline reported in prior work (Fabbri et al., 2021; Grusky et al., 2018). This clearly demonstrates that current automatic reference-based metrics cannot be used to reliably measure summary quality under the prompting paradigm.

Amongst prompting-based models, we observe that T0 summaries report better metric scores than GPT3-D2 for all datasets except Newsroom. Interestingly, out of the four datasets evaluated here, Newsroom is the only one not used to train the T0 model. This further shows that access to dataset- specific reference summaries during training im- proves performance according to these metrics, ren- dering them unsuitable for evaluating prompt-based models.

4.2 Reference-free metrics

Next, we investigate whether current reference-free evaluation metrics reflect the human preference rankings between summarization systems, as ob- served in Section 3. Here, we study 2 categories of metrics: (1) quality metrics, specifically SU- PERT (Gao et al., 2020), which evaluates generated summaries against automatically identified salient sentences in the input, and BLANC (Vasilyev et al., 2020), which evaluates summaries on language understanding tasks. We refer readers to the orig- inal papers for detailed explanation of these. (2) factuality metrics, that are evaluate whether gener- ated summaries contain incorrect information with respect to the source article. We report the perfor- mance of summarization systems using two QA- based metrics: QuestEval (Scialom et al., 2021) and QAFactEval (Fabbri et al., 2022). Addition- ally, we also benchmark entailment-based metrics: FactCC (Kryscinski et al., 2020), DAE (Goyal and Durrett, 2020, 2021) and SummaC (Laban et al., 2022).[3] These entailment-based models are designed for classification into factual or non-factual; therefore, we use P (factual | article, summary) to score generated summaries.

Results Table 5 outlines the scores for each sum- marization system according to the above reference- free metrics. Ideally, we want the relative rankings of different systems according to these metrics to correspond to human preferences, i.e. GPT3-D2 > BRIO > T0 for CNN/DM10 and GPT3-D2 > T0 > BRIO

for XSum.11

Overall, we observe that none of the reference-free metrics we evaluate follow these trends for both CNN/DM and XSum datasets. In particular, we observe that GPT3-D2 summaries report low factuality scores (except XSum) even though we rarely found any factual errors in our qualitative analysis of its generated summaries.

Interestingly, we noticed a roughly inverse relation to abstractiveness; summarization systems that generated more abstractive summaries (see Table 2) were generally scored lower by all automatic reference-based metrics. For instance, GPT3-D2 is scored lower than BRIO by both quality metrics for all datasets except XSum; the latter is the only dataset for which GPT3-D2 summaries are less ab- stractive. Such shortcomings of reference-free eval- uation metrics due to spurious correlations have also been studied in prior work (Durmus et al., 2022). These issues become more exaggerated when the summarization systems being compared exhibit very different properties.

Discussion On the surface, the failure of reference-free metrics at evaluating GPT3-D2 sum- maries is more surprising that reference-based met- rics as the later explicitly compares generated sum- maries with references that GPT3-D2 is not trained to imitate. Therefore, GPT3-D2 understandably scores lower than fine-tuned systems.

However, we note two different issues with reference-free metrics: (1) Some of these, e.g. FactCC and DAE, use reference summaries as pos- itive examples to train the metric. Therefore, al-

10Although the human study in Section 3 is only run on CNN articles, the underlying fine-tuned model is same for both CNN and DM. Therefore, it we can reasonably expect it to display similar quality differences with respect to GPT3-D2. 11Note that while annotators were not explicitly asked to rate factuality, we instructed them to carefully check factuality

and appropriately downvote non-factual summaries.

though “reference-free” at test time, they are still trained to reward the summary properties seen in the standard summarization benchmarks. (2) Even completely reference-free metrics, e.g. QuestE- val and QAFactEval, have only been evaluated on reference-based benchmarks and fine-tuned mod- els. Therefore, the choice of different components, such as question answering or question generation models to use, etc. has been dictated by the error space of prior fine-tuned models (Tang et al., 2023). These decisions also now need to be re-visited to incorporate GPT3-D2 evaluation; we leave this for future work.

5 Beyond Generic Summarization

Previously, we observed that GPT3-D2 models faith- fully follow simple “style” instructions in the given prompts. This provides a promising direction to tackle other use cases in news summarization be- yond the generic summarization task from Sec- tion 3.

Different users can have very different infor- mation needs from the same article, all of which cannot be satisfied with a single generic summary. Prior work has introduced several task formulations to address this gap, including keyword-focused (He et al., 2022a), query-focused (Baumel et al., 2014; He et al., 2022a), or aspect-focused summarization (Krishna and Srinivasan, 2018; Ahuja et al., 2022), amongst others. Here, we evaluate GPT3-D2 performance at two of these use cases.

In keyword-based summarization, the output summaries must succinctly summarize the input document focusing on a given keyword; these generally correspond to specific entities or events di- rectly mentioned in the document. In contrast, the control units in aspect-based summarization are high-level topics that can be common across multiple similar types of documents. For e.g., for the input article in Figure 1, Donald Trump or Russian interference in 2016 elections are keyword controls whereas charges against the defendants is a higher-level aspect that can serve as the query for any news article discussing a lawsuit or investigation.

5.1 Qualitative Analysis

Baseline Model for comparison We use the recently proposed CTRLSum (He et al., 2022a), a fine-tuned BART model, as our baseline. It can be flexibly adapted for both keyword- and aspect-based settings by including a prompt as additional input

Keyword-based

Aspect-based

Summarize the above article in 2-3 sentences focusing on William Barr.

In the above article, who is a defendant or under investigation?

In the above article, what is the defendant's reaction to charges, including admission of guilt, denial of charges, or explanations?

Missing context Poor discourse

Article: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue over why classified documents were at Mar-a-Lago and the troubling question of whether national security was put at risk. […] At one end of the scale is Missouri Sen. Josh Hawley, a firm Trump backer who is often mentioned as a future presidential candidate once the ex-President finally exits the political stage. Hawley promoted Trump's line that the key issue was the FBI search of an ex-President's home, not what he might have kept there. "I don't know what he has or doesn't have, but what I have a lot of concerns about is the FBI raiding the home of the likely presidential nominee in 2024, the former President of the United States, and particularly as we learn more and more about what's going on with the FBI," Hawley told CNN's Manu Raju on Wednesday. […] Ex-Attorney General William Barr has repeatedly questioned publicly what his former boss was doing with classified material. There has been speculation that Trump might want such material as keepsakes or as some kind of leverage in a future business deal. But none of this is informed given the secrecy sounding the DOJ's investigation. […]

Link: https://www.cnn.com/2022/09/08/politics/republicans-trump-national-security-implications-classified-docs

CTRLSum: Sen. Marco Rubio downplays the seriousness of the Department of Justice's investigation. Ex-Attorney General William Barr has repeatedly questioned what his former boss was doing with classified material. Trump and his allies have filled the vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material.

GPT3-D2: The article discusses how various Republicans have been defending Donald Trump in light of the recent revelations that he had classified documents at his Mar-a-Lago resort. While some have downplayed the significance of the situation, others have outright denied that there is any cause for concern. However, former Attorney General William Barr has repeatedly questioned Trump's actions, speculating that he may have been trying to use the classified information for personal gain.

image

image Includes article context before keyword-specific information

CTRLSum: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue over why classified documents were at Mar-a-Lago and the troubling question of whether national security was put at risk. In the article, who is a defendant or under investigation?

GPT3-D2: Donald Trump. The FBI. The Department of Justice. Sandy Berger.

image

Does not directly address aspect. Repeats the prompt at the end.

CTRLSum: Republican defenders of Donald Trump won't or can't answer the questions that are at the root of the intrigue over why classified documents were at Mar-a-Lago, says Julian Zelizer. Zelizer: Trump and his allies have filled the vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material or it might have been planted by the FBI.

GPT3-D2: The defendant's reaction to charges in the above article is denial of charges. Trump and his allies have filled the vacuum for answers with a smokescreen of unproven claims, including that Trump declassified the material or that it might have been planted by the FBI in order to persecute him.

Incorrectly answers the query.

Does not directly answer query.

image Summary captures aspect-relevant content.

Figure 6: Comparison of keyword- and aspect-based summaries using GPT3-D2 and CTRLSum models. The GPT3-D2 prompt is shown on the left with the corresponding keyword or aspect bolded. For keyword-based summarization, the GPT3-D2 summary presents appropriate context before the keyword-specific information. How- ever, for aspect-based summarization, it does not always generate factually correct summaries, as shown in the first aspect example. We observe that CTRLSum performs poorly for both these settings.

to the encoder. We use the prompt template recom- mended in the original paper.12

Control Units

For the keyword-focused setting, we use named entities extracted from the input arti- cle as the control units. For aspect-focused summa- rization, we directly use the aspects introduced in the guided summarization task from TAC 2011.13 It defined 5 broad categories of newswire articles, such as accidents and natural disasters, investiga- tions and trial, etc., and multiple aspects for each category. For example, the “investigations and tri- als” category includes aspects such as “who is the defendant or under trial?”, “who is investigating, prosecuting, judging?”, and so on.

Qualitative Analysis

Figure 6 shows examples of keyword- and aspect-focused summaries using GPT3-D2 and the baseline CTRLSum model. The keywords or aspects are highlighted in bold within the GPT3-D2 prompt displayed on the left.

12Trained model publicly released at: https://github. com/salesforce/ctrl-sum.‌

13https://tac.nist.gov/2011/Summarization/

Guided-Summ.2011.guidelines.html

In this example, representative of aver- age GPT3-D2 quality, the keyword-focused GPT3-D2 summary first gives a brief overview of the article setting before providing keyword- relevant information. In contrast, the CTRLSum summary exhibits poor discourse structure and reads like a list of facts stapled together.

The figure also shows aspect-focused summaries for two aspects associated with the “investigations and trial” category most appropriate for the chosen article. We see mixed results here for GPT3-D2; it generates a factually incorrect summary for the first aspect, listing multiple people from the input arti- cle as defendants instead of only “Donald Trump”. For the second aspect, it correctly maps the high- level concept “defendant” to “Donald Trump” in the input article and generates the correct answer to the input query: “The defendant’s reaction to charges in the above article is denial of charges”.

On the other hand, CTRLSum fails to generate aspect-focused summaries for both cases. We be- lieve that it struggles to align high-level concepts and explicit entities in the article due to a lack of

GPT3-D2 CTRLSum

Which keyword-focused summary is better?

image 0 image 1 image 2 3

Win % according to majority vote

69.8 %

30.2 %

by a majority of the annotators. The main ratio- nales given for this choice were better contextual- ization of keyword-related information and better coherence in GPT3-D2 summaries.

Impact These results show that prompting GPT-3 No. of votes for “best summary”

Figure 7: Distribution of annotator votes for the keyword-focused summarization task. Annotators pre- fer GPT3-D2 summaries over CTRLSum for approxi- mately 70% of all article-keyword pairs, showing unani- mous preference more than half the time.

such aspect-specific examples in its training data. Instead, it generates summaries focusing on lexi- cally similar words, i.e. “defenders” for both cases. Based off of GPT3-D2’s promising keyword- focused summarization capabilities observed above, we next conduct a human study to system- atically compare it against the CTRLSum baseline. We leave further explorations of aspect-based sum- marization to future work, given the mixed to poor

results for both models at this task.

5.2 Human Study: Keyword-focused summarization

Task Setup

Similar to Section 3, we design an A/B test to compare the two models. We use the same set of 100 CNN14 articles as Section 3. We randomly extract 2 distinct named entities from each article. In the study interface, the annota- tor is shown the article-keyword pair and GPT3-D2 and CTRLSum summaries corresponding to it. They are asked to select the summary that best summa- rizes the input article while focusing on the given keyword. Exact task instructions are included in Appendix F.

Again, we run this study using the Prolific plat- form. We recruit 60 participants to annotate the 100 articles; each article is annotated by 3 anno- tators which includes annotations for 2 separate keywords. Each annotator evaluates 5 articles.

Results

Figure 7 shows the distribution of an- notator votes between the GPT3-D2 and CTRLSum models. Annotators show a clear preference for GPT3-D2. In fact, for nearly 70% of all article- keyword pairs, GPT3-D2 is preferred over CTRLSum

14We run this study using only CNN articles as the baseline

CTRLSum model is trained on CNN/DM.

models present a promising alternative to fine- tuned models for such specialized summarization tasks that can be easily described using textual prompts. One of the major drawbacks of fine-tuned models is that they are constrained by what data is available and how it can be transformed to cre- ate new task-specific training data. CTRLSum relied on the SQuAD question answering dataset (Ra- jpurkar et al., 2016) because the required “queries” or “questions” were unavailable at scale for sum- maries in standard summarization datasets. In con- trast, prompt-based models are not constrained by the availability of task-specific data and can flexibly adapt to new tasks. Future research should focus on further exploring these capabilities and possible improvements on currently “unsolved” tasks such as aspect-based or plan-based summarization.

6 Discussion and Related Work

In recent years, research in text summarization (Rush et al., 2015; Nallapati et al., 2016; See et al., 2017; Lewis et al., 2020; Zhang et al., 2020; Liu et al., 2022) has typically relied on comparisons with gold test sets for evaluation, possibly aug- mented with reference-free metrics for dimensions like factuality. This paper shows that all these metrics are completely ineffective at evaluating GPT-3 summaries. Although issues with these metrics, particularly low correlation with human judgments, have also been studied earlier (Fabbri et al., 2021; Deutsch and Roth, 2021), they are considered reliable when comparing systems in dif- ferent score ranges (Peyrard, 2019; Deutsch et al., 2022). However, GPT-3 challenges these estab- lished practices and evaluation protocols, and poses an urgent need for better evaluation.

This brings us to manual evaluation, generally considered to be the gold standard for generation evaluation. The majority of summarization re- search now reports results from a human study in addition to automatic metrics, but there is a general lack of consensus on what dimensions to evalu- ate, task design, and other factors (Hardy et al., 2019). This presents difficulties in conducting re- liable and reproducible comparisons between sys- tems (Karpinska et al., 2021), another factor con-

tributing to the popularity of automatic metrics. Although recent efforts like GENIE (Khashabi et al., 2022) have taken steps to standardize manual eval- uation protocols across systems, its annotation is not universally affordable and the quality is not strictly monitored. We hope that future work ad- dresses these challenges and democratizes human evaluations.

The ultimate test of summarization systems is with actual users using the systems in practice. Jones (2007) discusses the need to align task formu- lations with actual applications scenarios (“purpose factors”). However, the research in text summa- rization until now has been constrained to certain problems or domains by the heavy dependence on large-scale training data: for example, producing a bullet-point summary of a news article has emerged as standard due to availability of data from CNN, not because it is shown to be the best way to present information.

Now, the success of prompt-based models can allow realistic use-cases to drive research in a more top-down way. We already show that GPT3-D2 im- proves upon prior keyword-focused summarization systems that were trained on artificially adapted training data. In future research, we are inter- ested in tackling other real world use cases, such as update summarization and plan- or aspect-based summarization. Additionally, adapting GPT3-D2 to documents longer than the allowed context, or structured inputs such as tables, presents research challenges beyond the current capabilities of GPT-3 and would be interesting to study.[4]

7 Conclusion

In this work, we performed the first systematic study comparing prompt-based GPT-3 and fine- tuned models at the news summarization task. We analyzed the impact of prompting on the summa- rization field, including training paradigms and evaluation practices. Finally, to support further research in this direction, we release a large corpus of generated summaries for multiple prompt-based and fine-tuned models, as well as human preference judgments comparing these systems.

8 Limitations

In the text generation evaluation literature, there does not exist a standardized task design for comparing different system generations. In our work, we chose a human evaluation workflow that directly asks annotators to compare systems, while other prior work has opted for Likert-scale judgments and/or evaluation along multiple quality dimen- sions (Gehrmann et al., 2022). The latter strategy of evaluating different dimensions could surface more insights into which “style” properties of GPT- 3 summaries provide them an edge over fine-tuned models; however, such analysis is outside the scope of this paper. Our experiments comparing overall quality reveal that current summarization datasets are not well-aligned with user preferences. We leave more fine-grained analysis into these preference judgments for future work.

The experiments in this paper are run on English-language news summarization datasets as these serve as common benchmarks in the summariza- tion literature. However, user rankings of system outputs might be different when evaluating other domains, e.g., summaries of scientific text. While we believe that automatic metrics would fail to eval- uate GPT-3 summaries on these domains also (generated summaries would still look different from the reference summaries), users may prefer models that are specifically fine-tuned on domain-specific data for niche domains.

Finally, we do not know exact datasets or tasks used to train GPT3-D2. It is possible that its RLHF training (Ouyang et al., 2022) included summariza- tion examples, and therefore, preference judgments from human annotators for its different outputs. However, our arguments in this paper do not rely on the specifics of the GPT3-D2 system, merely that such a system exists. If anything, the existence of potentially better data underscores that further work should collect new data for summarization model tuning, and our claims about metrics still hold regardless of the details of how the GPT3-D2 summaries were produced.

References

Ojas Ahuja, Jiacheng Xu, Akshay Gupta, Kevin Horecka, and Greg Durrett. 2022. ASPECTNEWS: Aspect-oriented summarization of news documents. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6494–6506.

Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation

measures for machine translation and/or summariza- tion, pages 65–72.

Tal Baumel, Raphael Cohen, and Michael Elhadad. 2014. Query-chain focused summarization. In Pro- ceedings of the 52nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 913–922.

Manik Bhandari, Pranav Narayan Gour, Atabak Ash- faq, and Pengfei Liu. 2020. Metrics also disagree in the low scoring range: Revisiting summarization evaluation metrics. In Proceedings of the 28th Inter- national Conference on Computational Linguistics, pages 5702–5711.

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901.

Dallas Card, Peter Henderson, Urvashi Khandelwal, Robin Jia, Kyle Mahowald, and Dan Jurafsky. 2020. With little power comes great responsibility. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9263–9274.

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Elizabeth Clark, Tal August, Sofia Serrano, Nikita Haduong, Suchin Gururangan, and Noah A Smith. 2021. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296.

Arman Cohan, Franck Dernoncourt, Doo Soon Kim, Trung Bui, Seokhwan Kim, Walter Chang, and Nazli Goharian. 2018. A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Or- leans, Louisiana. Association for Computational Lin- guistics.

Daniel Deutsch, Tania Bedrax-Weiss, and Dan Roth. 2021. Towards question-answering as an automatic metric for evaluating the content quality of a sum- mary. Transactions of the Association for Computa- tional Linguistics, 9:774–789.

Daniel Deutsch, Rotem Dror, and Dan Roth. 2022. Re- examining system-level correlations of automatic

summarization evaluation metrics. In Proceedings of the 2022 Conference of the North American Chap- ter of the Association for Computational Linguistics: Human Language Technologies, pages 6038–6052, Seattle, United States. Association for Computational Linguistics.

Daniel Deutsch and Dan Roth. 2021. Understanding the extent to which content quality metrics measure the information quality of summaries. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 300–309.

Esin Durmus, He He, and Mona Diab. 2020. FEQA: A question answering evaluation framework for faith- fulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5055– 5070.

Esin Durmus, Faisal Ladhak, and Tatsunori B Hashimoto. 2022. Spurious correlations in reference- free evaluation of text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1443–1454.

Alexander Fabbri, Chien-Sheng Wu, Wenhao Liu, and Caiming Xiong. 2022. QAFactEval: Improved QA- based factual consistency evaluation for summariza- tion. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technolo- gies, pages 2587–2601, Seattle, United States. Asso- ciation for Computational Linguistics.

Alexander R Fabbri, Wojciech Kryscinski, Bryan Mc- Cann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. SummEval: Re-evaluating summariza- tion evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.

Yang Gao, Wei Zhao, and Steffen Eger. 2020. SUPERT: Towards new frontiers in unsupervised evaluation metrics for multi-document summarization. In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 1347– 1354.

Sebastian Gehrmann, Elizabeth Clark, and Thibault Sel- lam. 2022. Repairing the cracked foundation: A sur- vey of obstacles in evaluation practices for generated text. arXiv preprint arXiv:2202.06935.

Tanya Goyal and Greg Durrett. 2020. Evaluating factu- ality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3592–3603.

Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computa- tional Linguistics: Human Language Technologies, pages 1449–1462.

Tanya Goyal, Jiacheng Xu, Junyi Jessy Li, and Greg Durrett. 2022. Training dynamics for text summa- rization models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2061–

2073.

Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long Pa- pers), pages 708–719.

Hardy Hardy, Shashi Narayan, and Andreas Vlachos. 2019. Highres: Highlight-based reference-less evalu- ation of summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3381–3392.

Junxian He, Wojciech Kryscinski, Bryan McCann, Nazneen Rajani, and Caiming Xiong. 2022a. CTRL- sum: Towards generic controllable text summariza- tion. In Proceedings of the 2022 Conference on Em- pirical Methods in Natural Language Processing, pages 5879–5915, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie Mei, Yang Liu, Ruochen Xu, Hany Hassan Awadalla, Yu Shi, Chenguang Zhu, et al. 2022b. Z-Code++: A pre-trained language model optimized for abstractive summarization. arXiv preprint arXiv:2208.09770.

Karl Moritz Hermann, Tomas Kocisky, Edward Grefen- stette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in Neural Information Processing Systems, 28.

Karen Spärck Jones. 2007. Automatic summarising: The state of the art. Information Processing & Man- agement, 43(6):1449–1481.

Marzena Karpinska, Nader Akoury, and Mohit Iyyer. 2021. The perils of using mechanical turk to evaluate open-ended text generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1265–1285.

Daniel Khashabi, Gabriel Stanovsky, Jonathan Bragg, Nicholas Lourie, Jungo Kasai, Yejin Choi, Noah A. Smith, and Daniel Weld. 2022. GENIE: Toward re- producible and standardized human evaluation for text generation. In Proceedings of the 2022 Con- ference on Empirical Methods in Natural Language Processing, pages 11444–11458, Abu Dhabi, United Arab Emirates. Association for Computational Lin- guistics.

Kundan Krishna and Balaji Vasan Srinivasan. 2018. Generating topic-oriented summaries using neural attention. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies, Volume 1 (Long Papers), pages 1697–1705.

Wojciech Kryscinski, Bryan McCann, Caiming Xiong, and Richard Socher. 2020. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346.

Wojciech Kryscinski, Nazneen Fatema Rajani, Di- vyansh Agarwal, Caiming Xiong, and Dragomir R Radev. 2021. BookSum: A collection of datasets for long-form narrative summarization.

Philippe Laban, Tobias Schnabel, Paul N. Bennett, and Marti A. Hearst. 2022. SummaC: Re-visiting NLI- based models for inconsistency detection in summa- rization. Transactions of the Association for Compu- tational Linguistics, 10.

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and com- prehension. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, pages 7871–7880.

Chin-Yew Lin. 2004. ROUGE: A package for auto- matic evaluation of summaries. In Text summariza- tion branches out, pages 74–81.

Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing order to abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), pages 2890–2903.

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On Faithfulness and Factu- ality in Abstractive Summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.

Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettle- moyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? In Proceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generaliza- tion via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3470–3487.

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Caglar Gulcehre, and Bing Xiang. 2016. Abstrac- tive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Lan- guage Learning, pages 280–290.

Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Ex- treme Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Lan- guage Processing, pages 1797–1807.

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- roll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow in- structions with human feedback. arXiv preprint arXiv:2203.02155.

Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Understanding factuality in abstrac- tive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Con- ference of the North American Chapter of the Asso- ciation for Computational Linguistics: Human Lan- guage Technologies, pages 4812–4829.

Kishore Papineni, Salim Roukos, Todd Ward, and Wei- Jing Zhu. 2002. BLEU: a method for automatic eval- uation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computa- tional Linguistics, pages 311–318.

Rebecca J Passonneau. 2006. Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06).

Maxime Peyrard. 2019. Studying summarization eval- uation metrics in the appropriate scoring range. In Proceedings of the 57th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 5093– 5100.

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2020. Exploring the limits of transfer learning with a unified text-to-text trans- former. J. Mach. Learn. Res., 21(140):1–67.

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.

Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sen- tence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Lan- guage Processing, pages 379–389.

Victor Sanh, Albert Webson, Colin Raffel, Stephen Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2022. Multitask prompted training enables zero- shot task generalization. In The Tenth International Conference on Learning Representations.

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802.

Thomas Scialom, Paul-Alexis Dray, Sylvain Lamprier, Benjamin Piwowarski, Jacopo Staiano, Alex Wang, and Patrick Gallinari. 2021. QuestEval: Summariza- tion asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604.

Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer- generator networks. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073– 1083.

Liyan Tang, Tanya Goyal, Alexander R Fabbri, Philippe Laban, Jiacheng Xu, Semih Yahvuz, Wojciech Krys´- cin´ski, Justin F Rousseau, and Greg Durrett. 2023. Understanding factual errors in summarization: Er- rors, summarizers, datasets, error detectors. Associa- tion for Computational Linguistics.

Oleg Vasilyev, Vedant Dharnidharka, and John Bohan- non. 2020. Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 11–20.

Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned language mod- els are zero-shot learners. In International Confer- ence on Learning Representations.

Xi Ye and Greg Durrett. 2022. The unreliability of ex- planations in few-shot prompting for textual reason- ing. In Advances in Neural Information Processing Systems.

Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. PEGASUS: Pre-training with extracted gap-sentences for abstractive summarization. In In- ternational Conference on Machine Learning, pages 11328–11339. PMLR.

Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. BERTScore: Evaluating Text Generation with BERT. In Inter- national Conference on Learning Representations.

Yusen Zhang, Ansong Ni, Ziming Mao, Chen Henry Wu, Chenguang Zhu, Budhaditya Deb, Ahmed Awadal- lah, Dragomir Radev, and Rui Zhang. 2022. SummN: A multi-stage summarization framework for long in- put dialogues and documents: A multi-stage sum- marization framework for long input dialogues and documents. In Proceedings of the 60th Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1592–1604.

Yusen Zhang, Ansong Ni, Tao Yu, Rui Zhang, Chen- guang Zhu, Budhaditya Deb, Asli Celikyilmaz, Ahmed Hassan, and Dragomir Radev. 2021. An exploratory study on long dialogue summarization: What works and what’s next. In Findings of the Asso- ciation for Computational Linguistics: EMNLP 2021, pages 4426–4433.

Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improv- ing few-shot performance of language models. In Proceedings of the International Conference on Ma- chine Learning (ICML).

Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M Meyer, and Steffen Eger. 2019. MoverScore: Text generation evaluating with contextualized em- beddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Nat- ural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578.

Yao Zhao, Mohammad Saleh, and Peter J Liu. 2020. SEAL: Segment-wise extractive-abstractive long-form text summarization. arXiv preprint arXiv:2006.10213.


;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2022 NewsSummarizationandEvaluationiGreg Durrett
Tanya Goyal
Junyi Jessy Li
News Summarization and Evaluation in the Era of GPT-310.48550/arXiv.2209.123562022
  1. https://tagoyal.github.io/zeroshot-news-annotations.html
  2. This size is chosen to give sufficient statistical power (Card et al., 2020) while keeping costs for GPT3-D2 evaluation low to enable others to compare on this subset. We outline costs in Appendix D.‌
  3. Exact model versions and configurations used for these are outlined in Appendix A.‌
  4. We very briefly discuss long document summarization with GPT-3 in Appendix E.