Automated Text Summarization Performance Evaluation Task

From GM-RKB
Jump to navigation Jump to search

An Automated Text Summarization Performance Evaluation Task is a NLG evaluation task for an automated text summarization task.



References

2022

  • (Liang, Bommasani et al., 2022) ⇒ Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue Wang, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. (2022). “Holistic Evaluation of Language Models.” doi:10.48550/arXiv.2211.09110
    • QUOTE: Problem setting. We formulate text summarization as an unstructured sequence-to-sequence problem, where a document (e.g. a CNN news article) is the input and the LM is tasked with generating a summary that resembles the reference summary (e.g. the bullet point summary provided by CNN with their article). Figure 13 provides an example. This evaluation tests the abstractive summarization capabilities of the model, where the model is directly required to generate the summary rather than being explicitly constrained to copying words or larger extracts from the input document.

      To evaluate model performance, the model-generated summary is compared against a human- authored reference summary using automated metrics for overall quality (ROUGE-2; BERTScore; Lin, 2004; Zhang et al., 2020b), faithfulness (Laban et al., 2022; Fabbri et al., 2022), and attractiveness (Grusky et al., 2018). Faithfulness refers to whether all the information in the model summary is supported by the article (Cao et al., 2018; Durmus et al., 2020; Maynez et al., 2020). Extractiveness refers to the extent to which model summaries involving copying from the input document: the distinction between extractive and abstractive approaches has been widely discussed in the summarization literature (see Nenkova and McKeown, 2012). We compute extractiveness since prior work has shown that current summarization systems tend to be less faithful, on average, whenever they extract less (Durmus et al., 2020; Mrini et al., 2021; Ladhak et al., 2022).

      We pay special attention to faithfulness as neural models in particular often hallucinate content that diverges from what appears in the document being summarized. Consequently, it is important to measure and improve the faithfulness of these systems since unfaithful systems may be harmful by potentially spreading misinformation, including dangerous, yet hard to detect errors, when deployed in real-world settings. We first evaluate the LMs using recently proposed reference-free evaluation metrics that have been shown to get high correlations with human scores for faithfulness (Laban et al., 2022; Fabbri et al., 2022). Recent work has shown that some reference-free evaluation metrics may be mostly relying on spurious correlations (Durmus et al., 2022). Given this, we further conducted a human user study to validate and supplement the automated evaluation.

      Datasets. There is a growing collection of summarization datasets, including datasets that capture finer-grained and more specific summarization functions (e.g. summarizing multiple documents or conditional on a user query). Bommasani and Cardie (2020) show that there is significant diversity in summarization datasets along several axes, which makes selecting a few datasets to represent summarization rather challenging. Since we are especially interested in model faithfulness in this work (as this is a known failure mode of other neural approaches to summarization), we select the CNN/DailyMail (Hermann et al., 2015a) and XSUM (Narayan et al., 2018) datasets, which are the most well-studied datasets in the literature on summarization faithfulness. This also ensures domain coverage of news-type data. Importantly, these datasets differ along a central axis studied in summarization: XSUM is a dataset with largely abstractive reference summaries (meaning the string overlap between the document and its summary in the dataset is relatively small on average), whereas CNN/DailyMail is a dataset with largely extractive reference summaries. However, these datasets do not suffice in representing the full diversity of summarization, and we encourage future work to expand on our benchmark along this axis (e.g. add datasets from domains beyond news), particularly towards domains where there is greater demand for summaries (see Reiter, 2022). And we especially highlight that these two datasets have been the subject of critique, and that broader change is required for dataset and evaluation design in summarization and natural language generation (Gehrmann et al., 2022b; Reiter, 2022).

    • Summarization. To further explore the results for this task, see https://crfm.stanford.edu/helm/v1.0/?group=summarization. For summarization on CNN/DailyMail and XSUM, we begin by observing that the ROUGE scores tend to much lower than our qualitative judgments of summary quality, consistent with Goyal et al. (2022). For this reason, we conducted a further human evaluation to better understand properties of the summaries (§8.5.1: human-evaluation-summarization). With that said, broadly, we did find the ROUGE-2 scores did correlate with more accurate models across-the-board (e.g. the top models on both datasets based on ROUGE-2 largely overlapped with those at the top of the accuracy subfigure of Figure 26). In general, we saw a strong correlation with model size, as the largest models tended to be those with high ROUGE-2 scores for both scenarios, notably with TNLG v2 (530B) having the highest score for both scenarios with a reasonable margin for XSUM at 17.9 points when compared with second-place OPT (175B) at 15.6 points and no other model above 14 points.

      Beyond model accuracy, we found that getting models to produce summaries of appropriate length (so as to match the distribution of reference summaries in the dataset) was a key challenge, especially given models only observe a few in-context examples and, therefore, may be poorly specialized to the distribution (when compared to models explicitly trained/fine-tuned on the distribution, for which automated ROUGE scores may be more useful). In particular, comparing the compression scores across models, we see considerable variation, and the trends were not consistent across the two datasets. As a very striking example, GPT-3 ada v1 (350M) was one of the most compressive on CNN/DailyMail but the least compressive on XSUM by a wide margin (1.65 vs the next least compressive in GPT-3 babbage v1 (1.3B) at 6.12 where higher scores means more compression), suggesting the GPT-3 ada v1 (350M) model especially did not "understand" the specification of length requirements in the instructions in the prompt. In terms of the degree of abstraction in model generations, we found that the relationship between model quality and abstraction (measured in terms of both coverage and density from Grusky et al. (2018)) was very variable, with few consistent trends.

      Since the summarization scenarios were the scenarios that required the longest-form generation of all core scenarios, we paid special attention to the presence of generative harms. For stereotypical associations (for both race and gender) on both datasets, we found that all models exhibited very similar biases, especially on CNN/DailyMai. In part, we think alternative forms of measurement that propose means for controlling for bias in the source documents that are being summarized (i.e. attributing bias to the dataset vs. the model’s specific tendencies in generation) could be helpful in providing more acuity. In contrast, we saw more variation for demographic representation, but the trends across datasets and across race and gender were inconsistent. Interestingly, with respect to demographic representation, YaLM (100B) demonstrated the greatest racial bias on both datasets and the greatest gender bias on CNN/DailyMail (e.g. racial bias of 0.651 on CNN/DailyMail for race compared to the next highest of 0.399 from GPT-J (6B)), but was one of the least gender biased models on XSUM. And for toxicity, we found the incidence of toxicity to be very low for both datasets, suggesting the risk of toxicity for such innocuous use cases to largely be marginal. With that said, we emphasize this is summarization of news documents, where models be inherently less likely to generate toxic content given the domain of the documents. And we do note, while a very low rate of 0.6%, TNLG v2 (530B) achieves the highest toxicity rate in addition to the highest ROUGE-2 accuracy on XSUM.