2020 EvaluationofTextGenerationASurv

(Celikyilmaz et al., 2020) ⇒ Asli Celikyilmaz, Elizabeth Clark, and Jianfeng Gao. (2020). “Evaluation of Text Generation: A Survey.” In: arXiv preprint arXiv:2006.14799.

Subject Headings: Text Generation Performance Measure, Text Generation Benchmark.

Notes

Human-Centric Evaluation Methods.
- Amazon Mechanical Turk-based Text Generation Evaluation.
- Intrinsic Text Generation Evaluation.
  - Likert-scale question.
  - Best and worst rank
  - Adequacy: “how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation.”
  - Fluency, which measures the quality of the generated text only
  - Factuality , text to accurately reflect facts described in the context (logic and commonsense)
- Extrinsic Evaluation
  - An extrinsic evaluation has people evaluate a system’s performance on the task
- The Evaluators
  - For many NLG evaluation tasks, no specific expertise is required of the evaluators other than a proficiency in the language of the generated text.
  - Specialized groups of evaluators can be useful when testing a system for a particular set of users, as in extrinsic evaluation settings.
- Inter-Evaluator Agreement
  - Percent agreement
  - Cohen’s k-based
  - Fleiss’ k
  - Krippendorff’s
Untrained Automatic Evaluation Metrics

“assume that the generated text has significant word (or n-gram) overlap with the ground-truth text.” n-gram Overlap Metrics for Content Selection BLEU NIST F-SCORE WER ROUGE METEOR HLEPOR RIBES CIDER Distance-Based Evaluation Metrics for Content Selection Edit Distance-Based Metrics Vector Similarity-Based Evaluation Metrics Word Mover’s Distance (WMD) Sentence Mover’s Distance (SMD) n-gram-Based Diversity Metrics Type-Token Ratio (TTR) SELF-BLEU Explicit Semantic Content Match Metrics Syntactic Similarity-Based Metrics

Machine-Learned Evaluation Metrics

Sentence Semantic Similarity Based Evaluation word embeddings to represent sentences ELMO BERT Evaluating Factual Correctness Regression-Based Evaluation Evaluation Models with Human Judgments BERT-Based Evaluation Composite Metric Scores Two Case Studies of Task-Specific NLG Evaluation Conclusions and Future Directions Making evaluation explainable. Standardizing evaluation methods.

Cited By

Google Scholar: ~ 57+ Citations.

Quotes

Abstract

The paper surveys evaluation methods of natural language generation (NLG) systems that have been developed in the last few years. We group NLG evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics. For each category, we discuss the progress that has been made and the challenges still being faced, with a focus on the evaluation of recently proposed NLG tasks and neural NLG models. We then present two case studies of automatic text summarization and long text generation, and conclude the paper by proposing future research directions.

Chapter 1 Introduction

Natural language generation (NLG), a sub-field of natural language processing (NLP), deals with building software systems that can produce coherent and readable text. NLG can be applied to a broad range of NLP tasks such as generating responses to user questions in a chatbot, translating a sentence or a document from one language into another, offering suggestions to help write a story, or generating summaries of time-intensive data analysis. NLG evaluation is challenging mainly because many NLG tasks are open-ended. For example, a dialog system can generate multiple plausible responses for the same user input. A document can be summarized in different ways. Therefore, human evaluation remains the gold standard for almost all NLG tasks. However, human evaluation is expensive, and researchers often resort to automatic metrics for quantifying day-to-day progress and for performing automatic system optimization. Recent advancements in deep learning have yielded tremendous improvements in many NLP tasks. This, in turn, presents a need for evaluating these deep neural network (DNN) models for NLG.

In this paper we provide a comprehensive survey of NLG evaluation methods with a focus on evaluating neural NLG systems. We group evaluation methods into three categories: (1) human-centric evaluation metrics, (2) automatic metrics that require no training, and (3) machine-learned metrics. For each category, we discuss the progress that has been made, the challenges still being faced, and proposals for new directions in NLG evaluation.

1.1 Evolution of Natural Language Generation

NLG is defined as the task of building software systems that can write (i.e., producing explanations, summaries, narratives, etc.) in English and other human languages. Just as people communicate ideas through writing or speech, NLG systems are designed to produce natural language text or speech that conveys ideas to its readers in a clear and useful way. NLG systems have been used to generate text for many real-world applications such as generating weather forecasts, carrying interactive conversations with humans in spoken dialog systems (chatbots), captioning images or visual scenes, translating text from one language to another, and generating stories and news articles.

NLG techniques range from simple template-based systems that generate natural language text using rules and templates to machine-learned systems that have a complex understanding of human grammar. The first generation of automatic NLG systems uses rule-based or data-driven pipeline methods. In their seminal paper, Reiter & Dale (2000) present a classical three-stage NLG architecture, as shown in Figure 1.1. The first stage is document planning, in which the content and its order are determined and a text plan that outlines the structure of messages is generated. The second is the micro-planning stage, in which referring expressions that identify objects like entities or places are generated, along with the choice of words to be used and how they are aggregated. Collating similar sentences to improve readability with a natural flow also occurs in this stage. The last stage is realization, in which the actual text is generated, using linguistic knowledge about morphology, syntax, semantics, etc. Earlier work has focused on modeling discourse structures and learning representations of relations between text units for text generation (McKeown, 1985; Marcu, 1997; Ono et al., 1994; Stede & Umbach, 1998), for example using Rhetorical Structure Theory (Mann & Thompson, 1987) or Discourse Representation Theory (Lascarides & Asher, 1991). There is a large body of work that is based on template-based models and have used statistical methods to improve generation by introducing new methods such as sentence compression, reordering, lexical paraphrasing, and syntactic transformation, to name a few (Sporleder, 2005; Steinberger, 2006; Knight, 2000; Clarke & Lapata, 2008; Quirk et al., 2004).

**Figure 1.1:** The three stages of the traditional NLG process (Reiter & Dale, 2000).

These earlier text generation approaches and their extensions play an important role in the evolution of NLG research. The same is true for the NLG research in the last decade, in which we witness a paradigm shift towards learning representations from large textual corpora in an unsupervised manner using deep neural network (DNN) models. Recent NLG models are built by training DNN models, typically on very large corpora of human-written texts. The paradigm shift starts with the use of recurrent neural networks (Graves, 2013) (e.g., long-short term memory networks (LSTM) (Hochreiter & Schmidhuber, 1997), gated recurrent units (GRUs) (Cho et al., 2014), etc.) for learning language representations, such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014), and later sequence-to-sequence learning (Sutskever et al., 2014), which opens up a new chapter characterised by the wide application of the encoder-decoder architecture. Although sequence-to-sequence models were originally developed for machine translation they were soon shown to improve performance across many NLG tasks. These models’ weakness of capturing long-span dependencies in long word sequences motivates the development of attention networks (Bahdanau et al., 2015) and pointer networks (Vinyals et al., 2015). The Transformer architecture (Vaswani et al., 2017), which incorporates an encoder and a decoder, both implemented using the self-attention mechanism, is being adopted by new state-of-the-art NLG systems. There has been a large body of research in recent years that focuses on improving the performance of NLG using large-scale pre-trained language models for contextual word embeddings (Peters et al., 2018; Devlin et al., 2018; Sun et al., 2019; Dong et al., 2019), using better sampling methods to reduce degeneration in decoding (Zellers et al., 2019; Holtzman et al., 2020), and learning to generate text with better discourse structures and narrative flow (Yao et al., 2018; Fan et al., 2019b; Dathathri et al., 2020; Rashkin et al., 2020).

Neural models have been applied to many NLG tasks which we will discuss in this paper, including:

summarization: common tasks include single or multi-document tasks, query-focused or generic summarization, and summarization of news, meetings, screen-plays, social blogs, etc.
machine translation: sentence- or document-level
dialog response generation: goal-oriented or chit-chat dialogs.
paraphrasing
question generation
long text generation: most common tasks are story, news, or poem generation.
data-to-text generation: e.g., table summarization.
caption generation from non-text input: input can be tables, images, or sequences of video frames (e.g., in visual storytelling), to name a few.

1.2 Why a Survey on Evaluation on Natural Language Generation

The question we are interested in in this paper is how to measure the quality of text generated from NLG models.

Text generation is a key component of language translation, chatbots, question answering, summarization, and several other applications that people interact with everyday. Building language models using traditional approaches is a complicated task that needs to take into account multiple aspects of language, including linguistic structure, grammar, word usage, and perception, and thus requires non-trivial data labeling efforts. Recently, Transformer-based neural language models have shown very effective in leveraging large amounts of raw text corpora from online sources (such as Wikipedia, search results, blogs, Reddit posts, etc.). For example, one of most advanced neural language models, GPT-2 (Radford et al., 2019), can generate long texts that are almost indistinguishable from human-generated texts (Zellers et al., 2019). Empathetic social chatbots, such as XiaoIce (Zhou et al., 2020), seem to understand human dialog well and can generate interpersonal responses to establish long-term emotional connections with users.

Nevertheless, training a powerful language model relies on evaluation metrics that can measure the model quality from different perspectives. For instance, it is imperative to build evaluation methods that can determine whether a text is generated by a human or a machine to prevent any potential harm. Similarly, evaluating the generated text based on factual consistency has recently drawn attention in the NLG field. It is concerning that neural language models can generate open-ended texts that are fluent but not grounded in real-world knowledge or facts, such as fake news. The situation is particularly alarming if the generated reports or news are related to the well-being of humankind, such as summaries of health reports (Zhang et al., 2019b). Thus, in addition to mainstream NLG evaluation methods, our survey also discusses recently proposed metrics to address human-facing issues, such as the metrics that evaluate the factual consistency of a generated summary or the empathy level of a chatbot’s response.

Many NLG surveys have been published in the last few years (Gatt & Krahmer, 2017; Zhu et al., 2018; Zhang et al., 2019a). Others survey specific NLG tasks or NLG models, such as image captioning (Kilickaya et al., 2017; Hossain et al., 2018; Li et al., 2019; Bai & An, 2018), machine translation (Dabre et al., 2020; Han & Wong, 2016; Wong & Kit, 2019), summarization (Deriu et al., 2009; Shi et al., 2018), question generation (Pan et al., 2019), extractive key-phrase generation (C¸ ano & Bojar, 2019), deep generative models (Pelsmaeker & Aziz, 2019; Kim et al., 2018), text-to-image synthesis (Agnese et al., 2020), and dialog response generation (Liu et al., 2016; Novikova et al., 2017; Deriu et al., 2019; Dusek et al., 2019; Gao et al., 2019), to name a few.

There are only a few published papers that review evaluation methods for specific NLG tasks, such as image captioning (Kilickaya et al., 2017), machine translation (Goutte, 2006), online review generation (Garbacea et al., 2019), interactive systems (Hastie & Belz, 2014a), and conversational dialog systems (Deriu et al., 2019), and for human-centric evaluations (Lee et al., 2019; Amidei et al., 2019b). The closest to our paper is the NLG survey paper of Gkatzia & Mahamood (2015), which includes a chapter on NLG evaluation metrics.

Different from this work, our survey is dedicated to NLG evaluation, with a focus on the evaluation metrics developed recently for neural text generation systems, and provides an in-depth analysis of existing metrics to-date. To the best of our knowledge, our paper is the most extensive and up-to-date survey on NLG evaluation.

1.3 Outline of The Survey

We review NLG evaluation methods in three categories in Chapters 2-4:

Human-Centric Evaluation. The most natural way to evaluate the quality of a text generator is to involve humans as judges. Naive or expert subjects are asked to rate or compare texts generated by different NLG systems or to perform a Turing test (Turing, 1950) to distinguish machine-generated texts from human-generated texts. Most human evaluations are task-specific, and thus need to be designed and implemented differently for the outputs of different tasks. For example, the human evaluation for image captioning is different from one for text summarization.
Untrained Automatic Metrics. This category, also known as automatic metrics, is the most commonly used in the research community. These evaluation methods compare machine-generated texts to human-generated texts (references) from the same input data using metrics that do not require machine learning but are simply based on string overlap, content overlap, string distance, or lexical diversity, such as n-gram match and distribution similarity. For most NLG tasks, it is critical to select the right automatic metric that measures the aspects of the generated text that are consistent with the original design goals of the NLG system.
Machine-Learned Metrics. These metrics are often based on machine-learned models, which are used to measure the similarity between two machine-generated texts or between machine-generated and human-generated texts. These models can be viewed as digital judges that simulate human judges. We investigate the differences among these evaluations and shed light on the potential factors that contribute to these differences.

In Chapter 5, we present two case studies of evaluation methods developed for two tasks, automatic document summarization and long-text generation (e.g., story or review generation), respectively. We choose these tasks because they have attracted a lot of attention in the NLG research community and the task-specific evaluation metrics they used can be adopted for other NLG tasks. We then provide general guidelines in building evaluation metrics that correlate well with human judgements. Lastly, we conclude the paper with future research directions for NLG evaluation.

Chapter 2 Human-Centric Evaluation Methods

Whether a system is generating an answer to a user’s query, a justification for a classification model’s decision, or a short story, the ultimate goal in NLG is to generate text that is valuable to people. For this reason, human evaluations are typically viewed as the most important form of evaluation for NLG systems and are held as the gold standard when developing new automatic metrics. Since automatic metrics still fall short of replicating human decisions (Reiter & Belz, 2009b; Krahmer & Theune, 2010; Reiter, 2018), many NLG papers include some form of human evaluation. For example, Hashimoto et al. (2019) report that 20 out of 26 generation papers published at ACL2018 present human evaluation results.

While human evaluations give the best insight into how well a model performs in a task, it is worth noting that human evaluations also pose several challenges. First, human evaluations can be expensive and time-consuming to run, especially for the tasks that require extensive domain expertise. While online crowd-sourcing platforms such as Amazon Mechanical Turk have enabled researchers to run experiments on a larger scale at a lower cost, they come with their own problems, such as maintaining quality control (Ipeirotis et al., 2010; Mitra et al., 2015). Furthermore, even with a large group of annotators, there are some dimensions of generated text that are not well-suited to human evaluations, such as diversity (Hashimoto et al., 2019). There is also a lack of consistency in how human evaluations are run, which prevents researchers from reproducing experiments and comparing results across systems. This inconsistency in evaluation methods is made worse by inconsistent reporting on methods; details on how the human evaluations were run are often incomplete or vague. For example, van der Lee et al. (2019) find that in a sample of NLG papers from ACL and INLG, only 55% of papers report the number of participants in their human evaluations.

In this chapter, we describe common approaches researchers take when evaluating generated text using only human judgments, grouped into intrinsic (§2.1) and extrinsic (§2.2) evaluations (Belz & Reiter, 2006). However, there are other ways to incorporate human subjects into the evaluation process, such as training models on human judgments, which will be discussed in Chapter 4.

2.1 Intrinsic Evaluation

An intrinsic evaluation asks people to evaluate the quality of generated text, either overall or along some specific dimension (e.g., fluency, coherence, correctness, etc.). This is typically done by generating several samples of text from a model and asking human evaluators to score their quality.

The simplest way to get this type of evaluation is to show the evaluators the generated texts one at a time and have them judge their quality individually. They are asked to vote whether the text is good or bad, or to make more fine-grained decisions by marking the quality along a Likert or sliding scale (see Figure 2.1(a)). However, judgments in this format can be inconsistent and comparing these results is not straightforward; Amidei et al. (2019b) find that analysis on NLG evaluations in this format is often done incorrectly or with little justification for the chosen methods.

To more directly compare a model’s output against baselines, model variants, or human-generated text, intrinsic evaluations can also be performed by having people choose which of two generated texts they prefer, or more generally, rank a set of generated texts. This comparative approach has been found to produce higher inter-annotator agreement (Callison-Burch et al., 2007) in some cases. However, while it captures models’ relative quality, it does not give a sense of the absolute quality of the generated text. One way to address this is to use a method like RankME (Novikova et al., 2018), which adds magnitude estimation (Bard et al., 1996) to the ranking task, asking evaluators to indicate how much better their chosen text is over the alternative(s) (see Figure 2.1(b)). Comparisonbased approaches can become prohibitively costly (by requiring lots of head-to-head comparisons) or complex (by requiring participants to rank long lists of output) when there are many models to compare, though there are methods to help in these cases. For example, best-worst scaling (Louviere et al., 2015) has been used in NLG tasks (Kiritchenko & Mohammad, 2016; Koncel-Kedziorski et al., 2019) to simplify comparative evaluations; best-worst scaling asks participants to choose the best and worst elements from a set of candidates, a simpler task than fully ranking the set that still provides reliable results.

(a) Likert-scale question .

(b) RankME-style question

Almost all the text generation tasks today are evaluated with intrinsic human evaluations. Machine translation is one of the text generation tasks in which intrinsic human evaluations have made a huge impact on the development of more reliable and accurate translation systems, as automatic metrics are validated through correlation with human judgments. One metric that is most commonly used to judge translated output by humans is measuring its adequacy, which is defined by the Linguistic Data Consortium as “how much of the meaning expressed in the gold-standard translation or source is also expressed in the target translation.”1 . The annotators must be bilingual in both the source and target languages in order to judge whether the information is preserved across translation. Another dimension of text quality commonly considered in machine translation is fluency, which measures the quality of the generated text only (e.g., the target translated sentence), without taking the source into account. It accounts for criteria such as grammar, spelling, choice of words, and style. A typical scale used to measure fluency is based on the question “Is the language in the output fluent?”. Fluency is also adopted in several text generation tasks including document summarization (Celikyilmaz et al., 2018; Narayan et al., 2018), recipe generation (Bosselut et al., 2018), image captioning (Lan et al., 2017), video description generation (Park et al., 2018), and question generation (Du et al., 2017), to name a few.

While fluency and adequacy have become standard dimensions of human evaluation for machine translation, not all text generation tasks have an established set of dimensions that researchers use. Nevertheless, there are several dimensions that are common in human evaluations for generated text. As with adequacy, many of these dimensions focus on the contents of the generated text. Factuality is important in tasks that require the generated text to accurately reflect facts described in the context. For example, in tasks like data-to-text generation or summarization, the information in the output should not contradict the information in the input data table or news article. This is a challenge to many neural NLG models, which are known to “hallucinate” information (Holtzman et al., 2020; Welleck et al., 2019); Maynez et al. (2020) find that over 70% of generated single sentence summaries contained hallucinations, a finding that held across several different modeling approaches. Even if there is no explicit set of facts to adhere to, researchers may want to know how well the generated text follows rules of commonsense or how logical it is. For generation tasks that involve extending a text, researchers may ask evaluators to gauge the coherence or consistency of a text—how well it fits the provided context. For example, in story generation, do the same characters appear throughout the generated text, and do the sequence of actions make sense given the plot so far?
Other dimensions focus not on what the generated text is saying, but how it is being said. As with fluency, these dimensions can often be evaluated without showing evaluators any context. This can be something as basic as checking for simple language errors by asking evaluators to rate how grammatical the generated text is. It can also involve asking about the overall style, formality, or tone of the generated text, which is particularly important in style-transfer tasks or in multi-task settings. Hashimoto et al. (2019) ask evaluators about the typicality of generated text; in other words, how often do you expect to see text that looks like this? These dimensions may also focus on how efficiently the generated text communicates its point by asking evaluators how repetitive or redundant it is.
Note that while these dimensions are common, they may be referred to by other names, explained to evaluators in different terms, or measured in different ways (van der Lee et al., 2019). More consistency in how user evaluations are run, especially for well-defined generation tasks, would be useful for producing comparable results and for focused efforts for improving performance in a given generation task. One way to enforce this consistency is by handing over the task of human evaluation from the individual researchers to an evaluation platform, usually run by people hosting a shared task or leaderboard. In this setting, researchers submit their models or model outputs to the evaluation platform, which organizes and runs all the human evaluations. For example, ChatEval is an evaluation platform for open-domain chatbots based on both human and automatic metrics (Sedoc et al., 2019), and TuringAdvice (Zellers et al., 2020) tests models’ language understanding capabilities by having people read and rate the models’ ability to generate advice. Of course, as with all leaderboards and evaluation platforms, with uniformity and consistency come rigidity and the possibility of overfitting to the wrong objectives. Thus, how to standardize human evaluations should take this into account. A person’s goal when producing text can be nuanced and diverse, and the ways of evaluating text should reflect that.

2.2 Extrinsic Evaluation

An extrinsic evaluation has people evaluate a system’s performance on the task for which it was designed. Extrinsic evaluations are the most meaningful evaluation as they show how a system actually performs in a downstream task, but they can also be expensive and difficult to run (Reiter & Belz, 2009a). For this reason, intrinsic evaluations are more common than extrinsic evaluations (Gkatzia & Mahamood, 2015; van der Lee et al., 2019) and have become increasingly so, which van der Lee et al. (2019) attribute to a recent shift in focus on NLG subtasks rather than full systems
Extrinsic methods measure how successful the system is in a downstream task. This success can be measured from two different perspectives: a user’s success in a task and the system’s success in fulfilling its purpose (Hastie & Belz, 2014b). Extrinsic methods that measure a user’s success at a task look at what the user is able to take away from the system, e.g., improved decision making, higher comprehension accuracy, etc. (Gkatzia & Mahamood, 2015). For example, Young (1999), which Reiter & Belz (2009a) point to as one of the first examples of extrinsic evaluation of generated text, evaluate automatically generated instructions by the number of mistakes subjects made when they followed them. System success extrinsic evaluations, on the other hand, measure an NLG system’s ability to complete the task for which it has been designed. For example, Reiter et al. (2003) generate personalized smoking cessation letters and report how many recipients actually gave up smoking.
Extrinsic human evaluations are commonly used in evaluating the performance of dialog (Deriu et al., 2019) and have made an impact on the development of the dialog modeling systems. Various approaches have been used to measure the system’s performance when talking to people, such as measuring the conversation length or asking people to rate the system. The feedback is collected by real users of the dialog system (Black et al., 2011; Lamel et al., 2000; Zhou et al., 2020) at the end of the conversation. The Alexa Prize2 follows a similar strategy by letting real users interact with operational systems and gathering the user feedback over a span of several months. However, the most commonly used human evaluations of dialog systems is still via crowd-sourcing platforms such as Amazon Mechanical Turk (AMT) (Serban et al., 2016a; Peng et al., 2020; Li et al., 2020; Zhou et al., 2020). Jurc´ıcek et al. (2011) suggest that using enough crowd-sourced users can yield a good quality metric, which is also comparable to the human evaluations in which subjects interact with the system and evaluate afterwards.

2.3 The Evaluators

For many NLG evaluation tasks, no specific expertise is required of the evaluators other than a proficiency in the language of the generated text. This is especially true when fluency-related aspects of the generated text are the focus of the evaluation. Often, the target audience of an NLG system is broad, e.g., a summarization system may want to generate text for anyone who is interested in reading news articles or a chatbot needs to carry a conversation with anyone who could access it. In these cases, human evaluations benefit from being performed on as wide a population as possible.
Typically evaluations in these settings are performed either in-person or online. An in-person evaluation could simply be performed by the authors or a group of evaluators recruited by the researchers to come to the lab and participate in the study. The benefits of in-person evaluation are that it is easier to train and interact with participants, and that it is easier to get detailed feedback about the study and adapt it as needed. Researchers also have more certainty and control over who is participating in their study, which is especially important when trying to work with a more targeted set of evaluators. However, in-person studies can also be expensive and time-consuming to run. For these reasons, in-person evaluations tend to include fewer participants, and the set of people in proximity to the research group may not accurately reflect the full set of potential users of the system. In-person evaluations may also be more susceptible to response biases, adjusting their decisions to match what they believe to be the researchers’ preferences or expectations (Nichols & Maner, 2008; Orne, 1962).
To mitigate some of the drawbacks of in-person studies, online evaluations of generated texts have become increasingly popular. While researchers could independently recruit participants online to work on their tasks, it is common to use crowdsourcing platforms that have their own users whom researchers can recruit to participate in their task, either by paying them a fee (e.g., Amazon Mechanical Turk) or rewarding them by some other means (e.g., LabintheWild , which provides participants with personalized feedback or information based on their task results). These platforms allow researchers to perform large-scale evaluations in a time-efficient manner, and they are usually less expensive (or even free) to run. They also allow researchers to reach a wider range of evaluators than they would be able to recruit in-person (e.g., more geographical diversity). However, maintaining quality control online can be an issue (Ipeirotis et al., 2010; Oppenheimer et al., 2009), and the demographics of the evaluators may be heavily skewed depending on user base of the platform (Difallah et al., 2018; Reinecke & Gajos, 2015). Furthermore, there may be a disconnect between what evaluators online being paid to complete a task would want out of a NLG system and what the people who would be using the end product would want.
Not all NLG evaluation tasks can be performed by any subset of speakers of a given language. Some tasks may not transfer well to platforms like Amazon Mechanical Turk where workers are more accustomed to dealing with large batches of microtasks. Specialized groups of evaluators can be useful when testing a system for a particular set of users, as in extrinsic evaluation settings. Researchers can recruit people who would be potential users of the system, e.g., students for educational tools or doctors for bioNLP systems. Other cases that may require more specialized human evaluation are projects where evaluator expertise is important for the task or when the source texts or the generated texts consist of long documents or a collection of documents. Consider the task of citation generation (Luu et al., 2020): given two scientific documents A and B, the task is to generate a sentence in document A that appropriately cites document B. To rate the generated citations, the evaluator must be able to read and understand two different scientific documents and have general expert knowledge about the style and conventions of academic writing. For these reasons, Luu et al. (2020) choose to run human evaluations with expert annotators (in this case, NLP researchers) rather than regular crowdworkers.

2.4 Inter-Evaluator Agreement

While evaluators often undergo training to standardize their evaluations, evaluating generated natural language will always include some degree of subjectivity. Evaluators may disagree in their ratings, and the level of disagreement can be a useful measure to researchers. High levels of inter-evaluator agreement generally mean that the task is well-defined and the differences in the generated text are consistently noticeable to evaluators, while low agreement can indicate a poorly defined task or that there are not reliable differences in the generated text.
Nevertheless, measures of inter-evaluator agreement are not frequently included in NLG papers. Only 18% of the 135 generation papers reviewed in Amidei et al. (2019a) include agreement analysis (though on a positive note, it was more common in the most recent papers they studied). When agreement measures are included, agreement is usually low in generated text evaluation tasks, lower than what is typically considered “acceptable” on most agreement scales (Amidei et al., 2018, 2019a). However, as Amidei et al. (2018) point out, given the richness and variety of natural language, pushing for the highest possible inter-annotator agreement may not be the right choice when it comes to NLG evaluation.

While there are many ways to capture the agreement between annotators (Banerjee et al., 1999), we highlight the most common approaches used in NLG evaluation. For an in-depth look at annotator agreement measures in natural language processing, refer to Artstein & Poesio (2008).

2.4.1 Percent agreement

A simple way to measure agreement is to report the percent of cases in which the evaluators agree with each other. If you are evaluating a set of generated texts $X$ by having people assign a score to each text $x_{i},$ then let $a_{i}$ be the agreement in the scores for $x_{i}$ (where $a_{i}=1$ if the evaluators agree and $a_{i}=0$ if they don't). Then the percent agreement for the task is:

$$ P_{a}=\frac{\sum_{i=0}^{|X|} a_{i}}{|X|} $$ (2.1)

So $P_a = 0$ means the evaluators did not agree on their scores for any generated text, while $P_a = 1$ means they agreed on all of them.

However, while this is a common way people evaluate agreement in NLG evaluations (Amidei et al., 2019a), it does not take into account the fact that the evaluators may agree purely by chance, particularly in cases where the number of scoring categories are low or some scoring categories are much more likely than others (Artstein & Poesio, 2008). We need a more complex agreement measure to capture this.

2.4.2 Cohen’s $κ$

Cohen’s $κ$ (Cohen, 1960) is an agreement measure that can capture evaluator agreements that may happen by chance. In addition to $P_a$, we now consider $P_c$, the probability that the evaluators agree by chance. So, for example, if two evaluators ($e_1$ and $e_2$) are scoring texts $X$ with a score from the set $S$, then $P_c$ would be the odds of them both scoring a text the same:

$$ P_c = \sum_{s\in S} P(s|e_1) * P(s|e_2) $$

For Cohen’s $κ$, $P(s|e_i)$ is estimated using the frequency with which Evaluator ei assigned each of the scores across the task. So, for example, if there are two scores, $0$ and $1$, and $e_1$ assigns $6$ scores as $0$s and $4$ scores as $1$s, and $e_2$ assigns $5$ $0$s and $5$ $1$s, then $P_c = 0.6 ∗ 0.5 + 0.4 ∗ 0.5$.

Once we have both $P_a$ and $P_c$, Cohen’s $κ$ can then be calculated as:

$$ κ = \frac {P_a − P_c} {1 − P_c} $$

2.4.3 Fleiss’ $κ$

As seen in Equation 2.2, Cohen’s $κ$ measures the agreement between two annotators, but often many evaluators have scored the generated texts, particularly in tasks that are run on crowdsourcing platforms. Fleiss’ κ (Fleiss, 1971) can measure agreement between multiple evaluators. This is done by still looking at how often pairs of evaluators agree, but now considering all possible pairs of evaluators. So now $a_i$, which we defined earlier to be the agreement in the scores for a generated text $x_i$, is calculated across all evaluator pairs:

$$a_i = \frac {\sum_{s\in S} \# \,of \,evaluator \,pairs \,who \,score\, x_i \,as \, s}{total \,\#\, of\, evaluator\, pairs} $$

Then we can once again define $P_a$, the overall agreement probability, as it is defined in Equation 2.1—the average agreement across all the texts. To calculate $P_c$, we estimate the probability of a judgment $P(s|e_i)$ by the frequency of the score across all annotators and assuming each annotator is equally likely to draw randomly from this distribution. So if rs is the proportion of judgments that assigned a score s, then the likelihood of two annotators assigning score $s$ by chance is $rs ∗ rs = r_s^2$. Then our overall probability of chance agreement is:

$$P_c = \sum _{s \in S} {r_s^2}$$

With these values for $P_a$ and $P_c$, we can use Equation 2.3 to calculate Fleiss’ $κ$.

2.4.4 Krippendorff’s $\alpha$

Each of the above measures treats all evaluator disagreements as equally bad, but in some cases, we may wish to penalize some disagreements more harshly than others. Krippendorff’s $\alpha$ (Krippendorff, 1970), which is technically a measure of evaluator disagreement rather than agreement, allows different levels of disagreement to be taken into account.

Like the $κ$ measures above, we again use the frequency of evaluator agreements and the odds of them agreeing by chance. However, we will now state everything in terms of disagreement. First, we find the probability of disagreement across all the different possible score pairs $(s_m, s_n)$, which are weighted by whatever value $w_{m,n}$ we assign the pair. So:

$$P_d=\sum_{m=0}^{|S|}\sum_{n=0}^{|S|}w_{m,n}\sum_{i=0}^{|X|}\frac{\#\, of\, evaluator\, pairs\, that\, assign\, {x_i} \,as\, $(s_m, sn)$}{total\, \# \,of\, evaluator\, pairs}$$

(Note that when $m == n$, i.e., the pair of annotators agree, $w_{m,n} should be 0.)

Next, to calculate the expected disagreement, we make a similar assumption as in Fleiss’ $κ$: the random likelihood of an evaluator assigning a score $s_i$ can be estimated from the overall frequency of $s_i$. If $r_{m,n}$ is the proportion of all evaluation pairs that assign scores $s_m$ and $s_n$, then we can treat it as the probability of two evaluators assigning scores sm and sn to a generated text at random. So $P_c$ is now:

$$P_c= \sum_{m=0}^{|S|}\sum_{n=0}^{|S|}w_{m,n}r_{m,n} $$

Finally, we can calculate Krippendorff’s $\alpha$ as: $$ \alpha= 1 - \frac{P_d}{P_c} $$

Chapter 3

Untrained Automatic Evaluation Metrics

With the increase of the numbers of NLG applications and their benchmark datasets, evaluation of NLG systems has become increasingly important. Today, the best evaluation for automatic NLG system output is human-based evaluation. However, human evaluation is costly and time-consuming to design and run, and more importantly, the results are not always repeatable (Belz & Reiter, 2006). Thus, automatic evaluation metrics are employed as an alternative in both developing new models and comparing them against state-of-the-art. In this survey, we group automatic metrics into two categories: untrained automatic metrics that do not require training (this chapter), and machinelearned evaluation metrics that are based on machine-learned models (Chapter 4).

In this chapter we review untrained automatic metrics used in different NLG applications and discuss their advantages and drawbacks in comparison with other approaches. Untrained automatic metrics for NLG evaluation are used to measure the effectiveness of the models that generate text, such as in machine translation, image captioning, or question generation. These metrics compute a score that indicates the similarity (or dissimilarity) between an automatically generated text and humanwritten reference (gold standard) text. Untrained automatic evaluation metrics are fast and efficient and are widely used to quantify day-to-day progress of model development, e.g., comparing model training with different hyperparameters. We group the untrained automatic evaluation methods, as in Table 3.1, into five categories:

n-gram overlap metrics
distance-based metrics
diversity metrics
content overlap metrics
grammatical feature based metrics

3.1 n-gram Overlap Metrics for Content Selection

n-gram overlap metrics are commonly used for evaluating NLG systems and measure the degree of “matching” between machine-generated and human-authored (ground-truth) texts. In this section we present several n-gram match features and the NLG tasks they are used to evaluate.

3.1.1 F-SCORE ($F_1$)

The F-SCORE, also called the F1-score or F-measure, is a measure of accuracy. The F-SCORE balances the generated text’s precision and recall by measuring the harmonic mean of the two measures. F-SCORE is defined as:

$$ F_1 = 2 * \frac{{Precision} \cdot {Recall}}{{Precision} + {Recall}}$$

Precision (specificity), also called the positive predictive value, is the fraction of n-grams in the model-generated (hypothesis) text that are present in the reference (human or gold) text. Recall, also called sensitivity, is the fraction of the n-grams in the reference text that are present in the candidate text. The F-SCORE reaches the best value, indicating perfect precision and recall, at a value of 1. The worst F-SCORE, which means lowest precision and lowest recall, would be a value of 0.

References

;

	Author	volume	Date Value	title	type	journal	titleUrl	doi	note	year
2020 EvaluationofTextGenerationASurv	Jianfeng Gao Elizabeth Clark Asli Celikyilmaz			Evaluation of Text Generation: A Survey						2020