2022 WhyisConstrainedNeuralLanguageG

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Constrained Natural Language Generation.

Notes

Cited By

Quotes

Abstract

Recent advances in deep neural language models combined with the capacity of large scale datasets have accelerated the development of natural language generation systems that produce fluent and coherent texts (to various degrees of success) in a multitude of tasks and application contexts. However, controlling the output of these models for desired user and task needs is still an open challenge. This is crucial not only to customizing the content and style of the generated language, but also to their safe and reliable deployment in the real world. We present an extensive survey on the emerging topic of constrained neural language generation in which we formally define and categorize the problems of natural language generation by distinguishing between conditions and constraints (the latter being testable conditions on the output text instead of the input), present constrained text generation tasks, and review existing methods and evaluation metrics for constrained text generation. Our aim is to highlight recent progress and trends in this emerging field, informing on the most promising directions and limitations towards advancing the state-of-the-art of constrained neural language generation research.

I. INTRODUCTION

Recent advances in the field of natural language generation (NLG) [36] have resulted in models able to produce realistic, coherent, and fluent texts in a multitude of natural language processing tasks. Powerful large scale language models can be readily used to perform unconditional language generation, however these models provide little control over attributes of the generated texts. Unlike conventional methods which were able to provide fine-grained control over many aspects of the system output including incorporating domain-specific dictionaries, terminology or certain words in the generated output, neural end-to-end approaches remove many of these knobs and switches [120]. However, imposing constraints on the output generated by these models is crucial for achieving useful and safe language generation in a multitude of real world application scenarios. For example, it can help avoid generic and meaningless responses in dialogue systems [131], personalize dialogue agents based on user features that lead to more engaging and meaningful conversations [164], ensure non-offensive sentence completion and friendly communica- tion [92], intervene on the system output in interactive scenar- ios where domain specific terminology must be included in the generated texts [21], or aid in art creation in applications such as poetry generation or assisted story writing [118]. Moreover, controlling a generic pretrained language model in order to satisfy certain desiderata helps avoid generating toxic content, prevents demographic biases, can steer generations towards desired a topic or style [66], and helps communicate intentions in suitable manners for different situations, target audiences and environments [77], [85]. Incorporating prior knowledge and target side constraints in text generative models has numerous applications in many natural language processing areas, including dialogue systems, machine translation, ques- tion answering, text summarization, text simplification, image captioning, etc. Unquestionably, constrained text generation is important in many real-world applications, but compared to other instances of natural language generation, constrained text generation using neural networks remains an open challenge. We identify the following reasons that explain why con- strained neural text generation represents a much harder prob- lem compared to other instances of neural text generation:

i) lack of model expressiveness: current models are not ex- pressive enough to incorporate arbitrary constraints, defined as testable conditions on the output text, into the objec- tive function at training time; ii) lack of suitable evaluation metrics: while one can verify whether an output satisfies a constraint or not, it is usually hard to measure to what extent an output satisfies a constraint, and it is even harder to jointly evaluate this with other properties of the generated text (such as relevance or coherence); iii) difficulty in constrained optimization: even if constraints can be expressed and added to the objective function, they are usually non-differentiable, especially at the token level. This is bad as most methods model and generate text as a sequence of tokens; iv) lack of constrained text generation datasets that are diverse and representative enough of the variety of practical constraints.

For example, commonly used sequential text generation methods and architectures assume a rigid modeling of the out- put sequence based on an ordering of words, in which tokens are generated progressively one at a time in a standard left-to- right manner [15]. Such autoregressive models cannot easily express constraints at arbitrary positions in the generated se- quence or satisfy constraints involving multiple input objects. In addition to these issues, it is generally more challenging to incorporate multiple and heterogeneous constraints, which conform to given rules, topics, sentiments, lexical constraints, or pre-defined stylistic and content attributes.

Our work focuses on the emerging problem of neural natural language generation with constraints. We first define the problem and differentiate between the ambiguous use of con- ditions and constraints in natural language generation, includ- ing examples that represent instantiations of the constrained neural text generation problem. We then survey approaches, learning methodologies and model architectures employed for generating texts with desirable attributes, and corresponding evaluation metrics. We conclude with open research problems and limitations of current models. The scope of our work is draw clear boundaries between the confusing terminology used in the neural language generation literature, highlight the main approaches and discuss how they suffer from the general challenges of constrained text generation, and serve as an informative guide and an advocate for solving these general challenges and advancing meaningful, useful, and safe constrained NLG research.

II. PROBLEM DEFINITIONS

We formally define the problem of natural language gen- eration, accounting for context, conditions, and constraints placed on text generative models. First, we aim to articulate the key difference between condition and constraint since the distinction between these concepts is rather blurred in the natural language processing literature. Given a text generation task defined as g(X) → X′, we define condition as a testable statement of the input X, and constraint as a testable statement of the output X′.

Accounting for the distinction above, we divide the text generation problem into three categories: i) generic or free-text generation which we present in Section II-A, ii) conditional text generation which we introduce in Section II-B, and iii) constrained text generation which we outline in Section II-C. The focus of our work is on the particular problem of con- strained text-to-text generation, leaving aside text generation tasks from other types of inputs such as data-to-text generation or image-to-text generation which are conditional in nature according to our definitions.

A. Generic / Free-Text Generation

The problem of generic text generation considers the in- trinsic history of words generated until the current timestep in the sequence as context, and does not place any external user-defined conditions or constraints on the model output.

Given a discrete sequence of text tokens x = (x1, x2, . . . , xn) as input where each xi is drawn from a fixed set of symbols, generic text generation aims to learn often result in repetitive, contradictory, and largely randomized generated texts [52]. Notably, the content generated by free- text generative models cannot be controlled with respect to particular attributes and modes of the data distribution. This inability to control which regions of the data distribution are generated is particularly problematic considering there is significant toxicity, hate, bias, and negativity present in the large-scale web crawled datasets text generation models are commonly trained on. Imposing conditions or constraints on the generation process results in safer and more useful generated texts for downstream application tasks [73].

B. Conditional Text Generation

Conditional text generation manipulates attributes of the generated content depending on specific contexts or user needs, and allows the data generation process to focus on specific modes of the data. Conditioning the generative model on additional information makes it possible to generate texts which satisfy given input conditions and meet desired at- tributes. In the literature conditional text generation is some- times referred to as context-dependent text generation. While the word context may carry different semantics for different readers, in this survey we consider as context only attributes which are inherently external to the model itself; model intrinsic attributes such as for example, the history of past generated words, is already included in the formulation of generic text generation. For example, context attributes used for conditioning generated texts are the source sentence in machine translation, the conversational history in dialogue systems, the input document in text summarization and text simplification, the input question in question answering sys- tems, or contextual information such as product, time, and location in review generation.

Conditional text generation models add a contextual variable or attribute code c to the probabilistic model p(x) transforming it into a conditional probability model p(x|c), which can be auto-regressively decomposed using the chain rule of proba- bility p(x|c) = n p(xi|x<i, c). When p(x|c) is modeled by a neural network with parameters θ, the model minimizes the negative log-likelihood loss function accounting for the attribute code c: L(D) = − L|D| log p (xk|xk , ck). Besides the unconditional probability distribution p(x) of sequence x.

k=1

θ i <i

This distribution can be auto-regressively factorized using the chain rule of probability [7] into a product of conditional generation, conditional models can also be used as generative classifiers to compute p(c|x<i) by applying Bayes rule.

probabilities p(x) = n p(xi|x<i) to perform density estimation and generation of language data. When p(x) is modeled by a neural network with parameters θ, the neural network is trained to minimize the negative log-likelihood L(D) = − L|D| log pθ(xk|xk ) over a collection of samples

C. Constrained Text Generation

The problem of constrained text generation is focusing on generating coherent and logical texts that do (not) cover lexical concepts (for eg., pre-defined nouns, verbs, entities, phrases or

k=1 i <i

D = {x1, . . . , x|D|}. To generate new samples, each token xi is iteratively sampled from pθ(xi|x<i) and is fed back into the model as the input for the next timestep. Large scale models for generic text generation show promis- ing abilities to imitate the distribution of natural language and generate long-term realistic and coherent texts, how- ever such free-text generation models place a lot of burden on the generative model to capture complex semantic and structural features underlying the data distribution; this can sentence fragments) desired to be (not) present in the output, as well as generate outputs that abide to specific format, semantic, syntactic or utility rules to reflect the particular interests of the system user. Constraints impose restrictions on the generative model that must be satisfied by any solution to the optimization problem and their fulfillment can be tested accordingly. In the literature the distinction between conditional, controlled, and constrained text generation is not clearly defined, and these terms are often used interchangeably. In fact, the first work that proposed generating constrained text is actually referring to the task as “controlled” generation [55]. In what follows we formally define the problem of constrained text generation. Let us consider we are (optionally) given an unordered or ordered set of n concepts x = {c1, c2, . . . , cn} ∈ X , where X denotes the space of all concepts, and ci ∈ C is a concept belonging to the concept vocabulary C. In addition, let us assume we are also (optionally) given a set of m rules y = {y1, y2, . . . , ym} ∈ Y, with yi ∈ R, where R denotes the space of all rules, and each yi is a text generation constraint expressed in logical form. We formulate constrained text generation as learning the structured predictive function the performance of different models, and measure overall success to inform on progress in constrained natural language generation. Due to these limitations, current methods proposed to address constrained text generation are neither satisfactory nor sufficient. The main machine learning challenge is that it is hard to evaluate the objective function for constrained text generation, and very few works have approached the problem from the prism of editing the objective function to incorporate constraints at training time. Even if constraints were to be added to the objective function itself, constrained optimization would be another challenge. In general, reinforcement learning approaches are used in the context of text generation to

f : X ∪ Y → Z, where X ∪ Y =/

φ which maps a set

optimize non-differentiable reward functions computed at the

of concepts and/ or constraint rules to a generated sentence. Therefore, constrained text generation methods impose con- straints on the generated sentences and produce output in the form of grammatical sentence z ∈ Z which contains all concepts present in x and all constraint rules specified in y. The probability p(z|f ) can still be modeled autoregressively p(z|f ) = n p(zi|z<i, f ); when p(z|f ) is modeled by a neural network with parameters θ, the negative log likelihood function can be minimized while leveraging f for constraint satisfaction L(D) = − L|D| log pθ(zk|zk , f ).

token level, for eg., BLEU in machine translation or ROUGE in text summarization. However, optimizing such automatic measures that focus on local n-gram patterns often results in deteriorated textual outputs despite increased automatic scores [9], [116]. Moreover, applying reinforcement learning to text generation at the word level leads to difficulty in proper tem- poral credit assignment for long-term textual rewards [128]. Given that the environment provides only delayed rewards as the agent executes a sequence of actions, it is impossible to know whether the agent succeeds in achieving a task until

k=1 i <i

The matching function f manipulates the probability distribution and indicates to which extent the constraints are satisfied. In the literature, constrained text generation methods can be either i) Soft-constrained (priming), when the matching function f is a soft measure of semantic similarity and only requires the generated sentences to be semantically related to the given constraints, or ii) Hard-constrained, when the matching function f is a binary indicator which rules out the possibility of generating infeasible sentences that do not meet the given constraints. Hard-constrained text generation is notably a more challenging task compared to soft-constrained text generation, and it requires designing specialized ap- proaches and architectures to ensure the constraints in the output sentence. In contrast, soft-constrained text generation models are usually easier to design, e.g., with the use of existing copy and attention mechanisms for soft enforcing constraints and annotated keyword-text pairs; nevertheless, some of these soft constraints are likely to be lost during generation, especially if multiple weakly correlated (lexical) constraints must be included [168].

Compared to generic text generation which assumes no conditions on input or output other than existing context, and compared to conditional text generation which places conditions on the input which can be considered at training time, constrained text generation places conditions on the output which is a considerably more difficult and challenging problem to solve. Unlike input conditions, output conditions cannot be considered at training time and their satisfaction is assessed after training has completed by sampling and inspecting the generated outputs. In addition, standard se- quence generation architectures are not designed to easily accommodate or incorporate output constraints. Given the model structure itself cannot express output conditions, it be- comes challenging to evaluate the extent to which constraints are satisfied by a model, objectively compare and contrast the end of the episode, at which point the agent needs to determine which of the actions in the sequence are to be credited with producing the resulting reward [34]. Adding constraints on top of existing reinforcement learning issues would be detrimental to the learning process, if not make learning close to impossible: the objective function would be even harder to optimize, rewards would be delayed, sparse and non-informative. Despite these open problems and limitations, we argue neural constrained text generation is an important research area which deserves a lot more attention.

Constrained text generation is useful in many scenarios, such as incorporating in-domain terminology in machine translation [120], improving semantic corectness [4], avoiding generic and meaningless responses in dialogue systems using grounding facts [106], paraphrase generation in monolingual text rewriting [54], [63], incorporating ground-truth text frag- ments (such as semantic attributes, object annotations) in image caption generation [1], creating a story [25] or poem [39] using a pre-defined set of keywords, or re-writing a user search query as a fluent sentence. Typical attributes used to generate constrained natural language are the tense and the length of the summaries in text summarization [24], the sentiment of the generated content in review generation [107], language complexity in text simplification or the style in text style transfer applications. In addition, constrained text generation is used to overcome limitations of neural text generation models for dialogue such as genericness and repetitiveness of responses [131], [134].

Nevertheless, generating text under specific lexical con- straints is challenging. Common models and architectures employed for natural language generation are autoregressive in nature, generating tokens one by one in a sequential manner from left to right; by design, these models lack fine control over the generated sequence and cannot easily support constraints at arbitrary positions in the output or constraints involving multiple input objects [168], [53]. While for humans it is straightforward to generate sentences that cover a given set of concepts or abide to pre-defined rules by making use of their commonsense reasoning ability, generative commonsense reasoning with a constrained text generation task is more challenging for machine learning models [86].

TABLE I OVERVIEW OF CONSTRAINED NLG TASKS, DIFFERENTIATING BETWEEN CONDITIONS AND CONSTRAINTS.

Task Condition _Lexical_ _Format_ _Semantic_ Constraint _Syntactic_ _Utility_
Machine Translation source input words - topic paraphrase target language politeness
phrases sentiment tense factuality/faithfulness
entities gender pronouns
Dialogue Generation past utterance(s) words length topic paraphrase politeness
phrases verbosity sentiment gender pronouns personality traits
entities toxicity factuality/faithfulness
Text Summarization input document(s) words length topic paraphrase factuality/faithfulness
phrases
entities
Text Simplification input text words length topic paraphrase simpler vocabulary
phrases readability
entities factuality/faithfulness
Text Style Transfer source text words length topic paraphrase style
phrases sentiment tense factuality/faithfulness
entities gender pronouns
Question Answering input question words length topic paraphrase factuality/faithfulness
phrases tense politeness
entities gender pronouns
Narrative Generation/Story telling - words length topic paraphrase readability
phrases sentiment tense factuality/faithfulness
entities gender pronouns style
Poetry Generation - words length topic paraphrase readability
phrases rhyme sentiment tense factuality/faithfulness
entities rhythm gender pronouns style

Here is the text with typos fixed and wiki links added:

III. NLG CONSTRAINTS

You're right, I missed fixing some of the dashes in words. Here is the text again with those fixed:

III. NLG CONSTRAINTS

Natural language generation models place restrictions on the generated output to produce texts that reflect certain user preferences. In Table I we present NLG tasks distinguishing between conditions and constraints. We broadly group existing constraints into the following categories:

a) Lexical constraints: Lexical constraints serve with the inclusion of specific keywords, phrases or entities at arbitrary positions in the output, and can be specified as a word (a single token) or phrasal constraint (a multi-word phrase). They are useful in tasks such as dialogue generation, machine translation, story telling or poetry generation.

b) Format constraints: Format constraints such as num- ber of sentences, length of sentences, order of words, number of syllables, etc. serve to denote preferences on the form and appearance of the generated output. Format constraints are particularly useful in tasks such as poetry generation to specify the form of the generated poem, for eg. quatrain or regulated verse, length of the poem, rhyme and rhythm. In text summarization or text simplification, length constraints define the length of the generated output to be strictly less than the length of the input document, while in dialogue generation they help define the level of verbosity of the dialogue agent.

c) Semantic constraints: Semantic constraints are used to define the topic and sentiment of the generated content, or control fine-grained aspects such as removing toxicity. Topic constraints are particularly useful in dialogue generation, where the goal is to generate on-topic responses that are safe, non-harmful, unbiased, relevant to the dialogue context and particular user needs; in story telling or poetry generation, topic constraints help define the theme. Generating language that conveys particular positive, neutral or negative sentiment aims to endow artificial agents with human-like traits such as compassion, empathy, and enables agents to react with appro- priate emotion in diverse social situations; constraining on a specific sentiment is important in many tasks such as dialogue generation, review generation, story telling, poetry generation or text style transfer. Furthermore, increasing politeness of a dialogue system or reducing toxicity of generated language are important aspects with respect to human-centered metrics of conversation quality.

d) Syntactic constraints: Syntactically constrained text generation produces sentences with desired syntax by incorporating syntactic templates and rules in the training of the text generative model. Syntactic constraints are useful in paraphrase generation, where given a sentence and a target syntactic form (e.g., a constituency parse), a system must produce a paraphrase of the sentence whose syntax conforms to the target [58]. Generating texts that convey the same meaning but with different expressions has numerous appli- cations in many natural language generation tasks, including monolingual transduction tasks such as text simplification, text compression, or text style transfer, as well as in tasks like text summarization, machine translation or question answering where alternative ways of expressing the same information help capture the inherent language variations.

e) Utility constraints: Utility constraints capture holistic properties of the generated output, for eg., stylistic, readability, faithfulness and politeness aspects. Preserving the information content of texts while manipulating attributes such as style, readability level, personality traits of the user or specific gen- der pronouns allows to customize generated texts to different audiences and make them relevant in a wide variety of end- user applications. Stylistic constraints are immediately relevant to the task of text style transfer, which has direct applicabil- ity in numerous other tasks, including dialogue generation, machine translation, text simplification, story telling, poetry generation, review generation.

Constraining text generation on attributes such as readability and level of text complexity serves to adapt the generated output to users of different age, backgrounds and educational levels. Reducing complexity of texts while preserving the information content is the main goal of text simplification; in addition, in tasks such as dialogue generation, text summarization, story telling, poetry generation, question answering it is important to customize texts for various literacy levels.

In many languages the degree of politeness is an important aspect of inter-personal communication, and honorifics are used to express courtesy, social distance, or the relative social status between the speaker and their addressee(s) [133]. Politeness constraints on the output are used in machine translation, dialogue generation, story telling, and text style transfer.

Faithfulness constraints enforce similarity between a gen- erated text sequence and its corresponding input, requiring models to generate texts that are faithful, factual and preserve the original information content. Such constraints are important in many tasks, including text summarization, machine translation, text simplification or dialogue generation, where models are vulnerable to producing hallucinated content.

Finally, language constraints are useful when translating texts between different languages such as in machine transla- tion, or from complex language into simple language such as in text simplification.

IV. CONSTRAINED NATURAL LANGUAGE TASKS

In what follows we briefly describe natural language gener- ation tasks, differentiating between conditions and constraints.

a) Machine Translation: Machine translation is focus- ing on the automatic translation of textual content from one language into another language, and is a typical example of both conditional and constrained text generation, as it conditions on the input text in the source language and constraints the model to generate fluent and faithful output in the target language. Additional constraints can be placed on the degree of formality and politeness, the use of gender- specific pronouns, the inclusion in the target sentence of named entities or specific concepts from the source sentence.

b) Dialogue Systems: A dialogue system, also known as a conversational agent, is a computer system designed to converse with humans using natural language. Dialogue generation is an instance of conditional text generation where the system response is conditioned on the previous user utterance and frequently on the overall conversational context. Dialogue generation can also be an instance of constrained text generation - it is desirable generated dialogues incorporate ex- plicit personality traits [172], control the sentiment [71], topic, degree of formality and politeness of the generated response to resemble human-to-human conversations. In addition, dialogue responses may need to incorporate text excerpts from past dia- logue history or entities such as locations, persons, institutions, etc. From an application point of view, dialogue systems can be categorized into: i) task-oriented dialogue agents, designed to help users complete a particular task, or ii) non-task oriented dialogue agents (chat-bots) designed to carry entertaining con- versations with their users on a wide range of open domains. A common problem in dialogue generation systems is that they tend to generate safe, universally relevant responses that carry little meaning [134], [81], [106]. Moreover, they can fail to take turns asking questions and balance specificity with genericness of the output [131].

c) Text Summarization: Text summarization facilitates a quick grasp of the essence of a document and produces a condensed version of its content, by copy-pasting the relevant portions from the input as in extractive summarization [108], or by generating novel content as in abstractive summarization [126], [109], [130], or via hybrid approaches [91] that combine both techniques. Text summarization is a conditional text generation task where the condition is represented by the given document(s); additional conditions are used in remainder sum- marization to flexibly define which parts of the document(s) are of interest, for eg., remaining paragraphs the user has not read yet, or in source-specific summarization to condition summaries on the specific input source and style of writing, for eg., newspapers, books or news articles. Text summarization is also a constrained text generation task considering that the length of the summary is fixed, pre-determined, and strictly less than the original document; this allows to digest infor- mation at different levels of granularity and detail according to user needs and time budgets. Moreover, constraints can be placed on specific concepts to include in the summary, such as named entities, or on explicitly picking sentences from the original document as in extractive summarization.

d) Text Simplification: Text simplification is designed to reduce the text complexity, while preserving its original meaning. In the literature, simplification has been addressed at multiple levels: i) lexical simplification focused on replacing complex words or phrases with simpler alternatives; ii) syntac- tic simplification alters the syntactic structure of the sentence; iii) semantic simplification paraphrases portions of the text into simpler and clearer variants. End-to-end models attempt to combine all these steps. Text simplification is both conditional and constrained text generation; we are conditioning on the input complex text to generate a simpler version, accounting for constraints such as higher readability, simpler vocabulary, and shorter sentence length than the complex input.

e) Text Style Transfer: Style transfer has its origins in computer vision applications for image-to-image translation and more recently has been used in natural language process- ing applications for machine translation, sentiment modifica- tion to change the sentiment of a sentence from positive to negative and vice versa, word substitution decipherment and word order recovery [55]. Text style transfer is designed to preserve the information content of a source sentence while altering the way it is delivered to meet desired presentation constraints. Textual content is disentangled from the style in which it is presented, and manipulating stylistic attributes can be done without parallel aligned data between source and target styles. Text style transfer is an instance of both conditional and constrained text generation given that we condition on the given source text and constrain the transferred sentences to stylistically match target examples.

f) Question Answering: Question answering systems are designed to find and integrate information from various sources to provide responses to user questions [31]. While traditionally candidate answers consist of words, phrases or sentence snippets retrieved and ranked appropriately from knowledge bases and textual documents [72], answer gener- ation aims to produce more natural answers by using neural models to generate the answer sentence. Question answering is both conditional and constrained text generation task; the system conditions on the user question, and simultaneously en- sures that concepts needed to answer the question are present in the generated output. Diverse question answering systems are proposed in the literature addressing for eg., medical information needs [156], mathematical questions [129], quiz bowl questions [57], cross-lingual and multi-lingual questions [94]. Notably, in practical applications users are not only interested in learning the exact answer word or phrase, but also in how it relates to background information and to previously asked questions and answers [31].

g) Narrative Generation / Story Telling: Neural nar- rative generation is an important step towards computational creativity [37] and represents a long-form open-ended text generation task which simultaneously addresses the selec- tion of appropriate content (“what to say”) and the surface realization of the generation (“how to say it”)[157]. Nar- rative generation is a constrained text generation task that places explicit constraints on concepts to steer the narrative in particular topic directions and expands the few keywords specified as the story title. While existing models can generate stories with good local coherence, generating long stories is challenging. Difficulties in coalescing individual phrases into coherent plots and in maintaining character consistency throughout the story lead to a rapid decrease in coherence as the output length increases [145]. Hierarchical models for story generation break down the generation process into multiple steps: first modelling the action sequence, then the story narrative, and finally entities such as story characters [26]. Neural narrative generation combining story-writing with human collaboration in an interactive way improves both story quality and human engagement [43].

h) Poetry Generation: The poem generator operates in an interactive context where the user supplies the model with a set of ordered concepts that reflect her writing intent, as well as the format of the poem, for eg. quatrain or regulated verse. Poetry generation is a constrained text generation problem since user defined concepts need to be included in the generated poem, and a conditional text generation problem given the explicit conditioning on stylistic attributes. For a detailed overview of poetry generation please see [114].

V. CONSTRAINED NLG METHODS

Accounting for the different types of constraints introduced in Section III, we distinguish five methodologies commonly employed in the constrained text generation literature: i) decoding approaches, ii) fine-tuning approaches, iii) discrimi- native approaches, iv) edit-based approaches, and v) adapting existing models and architectures to accommodate constraints on the generated output. In what follows we present each approach in detail, outlining the main associated challenges.

A. Decoding approaches
a) Lexical constraints

Lexically constrained (guided) decoding aims to restrict the search space at decoding time to sequences which contain pre-defined lexical constraints only. These lexical constraints can be specified in the form of a word constraint (a single token) or a phrasal constraint (a multi-word phrase, i.e. a sequence of two or more contiguous tokens). To this end, the beam search decoding algorithm is modified to enforce the inclusion of pre-specified words and phrases in the generated output by allowing the model distribution to not only account for the given lexical constraints, but also to generate parts of the output sequence not covered by the constraints. In general, the decoder can more easily place multiple sequential tokens in a phrasal constraint (where the permutation order is fixed) on the generated output as opposed to placing multiple separate, independent constraints. In addition, the lexically constrained decoding approach assumes lexical constraints are pre-determined, which may not always be the case; if so, the open question is where to get lexical constraints from.

Early work on constrained decoding in machine translation relies on the placeholder approach designed to recognize identifiable elements (numbers and named entities) in the source sentence, temporarily replace these with correspond- ing placeholders during preprocessing, and then substitute the assigned placeholders with the original source-language strings during beam search decoding [21]. Nevertheless, such an approach is limited and unable to model the source tokens in target language specific terminology or the vocabulary from a new out-of-distribution domain. Prefix decoding represents a modification of beam search to first ensure that a user defined target prefix is generated first, and only after build hypotheses for the suffix that maximize the coverage of the remaining source-side tokens. As decoding progresses from left to right, the decoder transitions from a constrained prefix decoding mode to unconstrained beam search. For example, the start of the sentence symbol <s> can be easily included as the first word of a constraint [70], [159]. In the context of text summarization, an essential property of a summarization system is the ability to generate a summary with desired length. Grid beam search [50] extends beam search decoding to allow for the inclusion of arbitrary target side hard lexical constraints at any position in the generated sequence. Given C input constraints, the algorithm maintains C + 1 separate beams B0, B1, . . . , Bc that group together hypotheses which meet the same number of satisfied constraints. Decoding runs similar to beam search, with an additional dimension added to keep track of how many constraints are met by each hypothesis at every timestep; the highest scoring hypothesis in beam Bc is ultimately generated. However, grid beam search is impractical as decoding complexity is linear in the number of constraints, i.e. beam size increases proportionally to the amount of constraints and changes for every sentence. Constrained beam search [1] guarantees the inclusion of input constraints in the generated sentences by extending beam search with a finite state machine whose states mark completed subsets of the input set of constraints; however, decoding complexity has an exponential cost in the number of constraints, making it infeasible in many applications. Dynamic beam allocation [120] improves upon the runtime complexity of grid beam search and constrained beam search by decoding with constant complexity O(1) in the number of constraints. The algorithm still groups together hypotheses that have met the same number of constraints by using a single fixed-size beam which is dynamically divided at each time-step according to how many constraints have been met. Despite being more efficient, dynamic beam allocation does not necessarily outperform conventional beam search [86]. In addition, the generation of hypotheses that only partially satisfy a phrasal constraint needs to be aborted to unwind to the tokens in the constraint. Neurologic decoding [95] modifies beam search to enforce the satisfaction of lexical constraints expressed under predicate logic in conjunctive nor- mal form (CNF). Given the intractability of exhaustive beam search to optimize CNF constraints, the algorithm searches for approximately-optimal output sequences in which all clauses are satisfied, including both positive and negative constraints (i.e. words that must be generated, respectively omitted in the output sequence). The method is applied to cooking recipe generation, where the task is to generate cooking instructions given a dish name and a list of ingredients, and to data- grounded dialogue response generation where a response is generated given a query and a list of facts to convey.

In general, lexically constrained decoding methods have high computational complexity and force the inclusion of specific words in the generated sentence at every timestep of the generation process with no prior examination of these specific words before generation begins [78]; this unnatural way of generating sentences can impact the quality and natu- ralness of the generated output [90], [120]. In lack of suitable evaluation metrics, there is no commonly agreed criteria for objectively assessing the quality of the generated sentences and conducting comparisons across text generation models.

b) Format constraints

Fixed length decoding [67] con- strains the length of generated summaries in two ways: i) by preventing the decoder from generating the end-of-sentence tag until the length of the generated sequence exceeds the desired length, and ii) by defining the minimum and maximum length range of the sequence and discarding out-of-range sequences. Non-monotonic decoding approaches allow tokens to be inserted at any position in the generated sequence during decoding, therefore accommodating flexible orderings of the output. Unlike left-to-right autoregressive generation that pro- duces a single word at a time, non-monotonic decoding can satisfy lexical constraints at multiple locations in the output sequence allowing for highly parallel generation and faster decoding times. Nevertheless, such approaches assume the generated sequence length is known a priori, preventing it from being dynamically adjusted as generation proceeds. Moreover, such models assume conditional independence between output tokens, i.e. tokens are generated independently, and may be inconsistent and agnostic to each other. Consequently, this approach may hurt the expressiveness of the model and lead to potential performance degradation, impacting the fluency and naturalness of the generated output. In addition, non- monotonic sequence decoding approaches can terminate pre- maturely before constraints are satisfied in the output sequence [168], [53]. The main limitation of this approach is the lack of model expressiveness in accommodating constraints.

Insertion Transformer [146] proposes a flexible sequence generation framework based on repeated insertion operations into an initially empty output sequence until a termination condition is met. The model adopts a progressive masking approach based on token importance in the original text and is trained to generate a missing token between every two tokens in the input. To this end, the original Transformer [150] decoder is modified to allow insertions not just at the end but anywhere in the output sequence. The model can decode sequences serially one token at a time, or it can decode sequences in parallel with simultaneous insertions at multiple locations. A similar approach is considered in InDIGO [45] which extends Transformer for insertion-based decoding with inferred generation order. Token generation order for the output sequence is modeled as a latent variable, and at each decoding step the model predicts both the generated word and its position in the output sequence; nevertheless, strong con- ditional independence is assumed between the output tokens which hurts output quality. An iterative refinement step based on latent variables is added to the Transformer decoder to refine a target sequence gradually over multiple steps until a predefined stopping criterion is met [80]. Progressive Insertion Transformer [168] uses non-autoregressive modeling based on a top-down progressive structure for lexical hard-constrained text generation. Given lexical constraints as input, the model inserts tokens progressively according to word importance to generate the target sequence, as follows: first it generates high-level words in a sentence such as nouns, adjectives and verbs, then uses these as pivoting points to insert details of finer granularity and finally completes the sentence by adding connecting words which carry less information, such as pronouns and prepositions. Entity Constrained Insertion Transformer [53] builds upon previous models considering hard lexical constraints in the form of entities in the output sequence. Similar approaches train the Transformer decoder to insert missing tokens in a partially complete sequence without relying on a pre-specified factorization of tokens [15],

[46]; based on the information available in the sequence, the insertion-based generative model is able to dynamically infer the remaining parts irrespective of their arbitrary order.

c) Syntactic and Semantic constraints

Distributional constraints [3] on topic and semantic similarity are used to incorporate source side-information at decoding time in neural conversational systems and encourage the generation of more diverse responses. Moreover, constraints over topics and syntax are used to generate matching or semantically similar statements in response to the user input [113]. Lexically constrained decoding from pre-trained language models aims to steer language models in useful and safe directions so as to minimize the risks associated with these models generating biased, offensive and toxic content [140], [52].

B. Fine-tuning approaches
a) Semantic and Utility constraints

Controlling the out- put of pre-trained language models is crucial in a wide-range of safety-critical applications, including mental health support chatbots, sentiment controlled text generation, language detox- ification, etc. To this end, fine-tuning approaches are used for fine-grained control over individual stylistic aspects (for eg., length, professional and descriptive style, tense, personal voice, gender) and content aspects (for eg., sentiment and topic) of the generated texts [28], [77]. Typically, the pre- trained model is fine-tuned separately for each attribute of in- terest, which poses the challenge of how to learn disentangled latent representations of style and content in neural language models [62] and isolate the desired attribute from the distri- bution shift between the generative model and the fine-tuned dataset. The lack of datasets that are diverse and representative of constrained criteria encountered in practice represents an open challenge for fine-tuning pre-trained models.

CTRL [65] uses control codes to trigger the generation of texts that meets user-defined constraints on domain, style, top- ics, dates, entities, relationships between entities, plot points, and task-related behavior. The pre-defined codes are appended to the beginning of raw text sequences to define task-specific data at training time and create controllable task-specific be- haviour at sampling time. Decoding Experts (DExperts) [88] is a decoding-time method for constrained text generation which combines a pretrained language model with both an “expert” and “anti-expert” language model in a product of experts. The “expert” models desirable aspects of the generated text (for eg., positive sentiment), while the “anti-expert” plays the antagonistic role of modeling undesirable attributes to be avoided (for eg., toxicity); each one of the three language models is conditioned on the same user prompt. While the method highlights the promise of customizing decoding from pretrained language models in safe and efficient ways, gather- ing large amounts of toxic data to model undersirable attributes may be challenging. In general, adding negativity to a positive prompt is a much easier task than adding a positive turn to a negative prompt [99]. Fine-tuning approaches in a reinforcement learning setting based on human preferences is used to generate texts with desired attributes from pre-trained language models [176].

Human preference learning is considered crucial for safely deploying artificial systems in real-world tasks. The reward model is derived from human preferences on text continua- tions with positive sentiment or vividly descriptive language. Importantly, a KL constraint is used to prevent the fine-tuned model from drifting too far from the pre-trained model and encourage the new policy to remain close to the prior policy. Similar KL control has been used in dialogue systems to retain prior information and penalize divergence from the pre- trained model during RL fine-tuning [59], [59]. Controlled text generation from pre-trained language models is formal- ized as a constraint satisfaction problem, where pointwise constraints focus on the quality of each individual output while distributional constraints enforce collective statistical properties desirable over the set of all generations [66]. Similar to prior work, a KL penalty term is used to discourage large deviations from the pre-trained language model as a proxy for sample quality. The lack of suitable evaluation metrics is an outstanding challenge in generating high quality outputs.

Pre-trained OpenAI-GPT2 [123] model is used to re-write a story through counterfactual reasoning and generate a narrative consistent with the imposed constraints [121]. In abstractive summarization, OpenAI-GPT2 is used in a reinforcement learning setting which trains the summarization agent to maximize coverage and fluency of the generated content constrained on a pre-defined length [76]. RecipeGPT [48] fine- tunes the GPT-2 pre-trained language model for generating cooking instructions when hard constraints are placed on the recipe title and ingredients; the model can also generate the list of ingredients for a recipe when constrained on the recipe title and specific cooking instructions. While fine-tuning models on task specific datasets has become the dominant paradigm for constrained text generation from large pre-trained language models, these models gener- ally fail to reliably incorporate the underlying constraints in the generated texts even when supervised with large amounts of task-specific examples [95]. Notably, fine-grained constrained text generation is limited even with large scale pre-trained neural networks. The main challenges are the lack of model expressiveness to incorporate constraints and the lack of con- strained text generation datasets for fine-tuning these models.

C. Discriminative approaches

a) Utility constraints: One of the early works to propose constrained generation and manipulation of the generated text learns disentangled latent representations by combining variational auto-encoders with attribute discriminators [55]. Semantic structure is imposed on the latent codes by using global discriminators, one for each attribute, to guide the learning of the discrete text generator and force it to allocate one latent dimension per attribute code. The model is used to generate sentences with constrained sentiment and tense.

Weighted decoding [51] relies on a mixture of discrimina- tive models to guide a recurrent generator towards incorpo- rating attributes that enhance the overall coherence, style, and information content of the generated text. The discriminators complement each other and their weighted contributions form the final decoding objective from the generator. Similarly, stylistic configurations are revised and polished for generated poems by adding additional weights during decoding to control the style of generated poem, including the repetition, alliter- ation, word length, cursing, sentiment, and concreteness [39]. Nevertheless, modifying the scoring function used for genera- tion as in weighted decoding often leads to sacrificing fluency and coherence of the generated text [131]. Selective sampling [153] relies on a sample selector (multilayer perceptron for binary classification) which outputs whether the current sam- ple should be accepted or rejected based on the presence of desired target words that define the output style and topic in the generated sequence. The robustness of evaluation metrics is directly correlated with model performance, therefore it is crucial to focus on developing metrics that capture diverse aspects of text quality during training and sampling time.

Generating texts with desirable attributes from a pre-trained unconditional language model P (X) is a non-trivial task. Most approaches resort to either training from scratch a new conditional model P (X|a) for desired attribute a, or fine-tuning P (X) on additional data representative for the attribute a. Theoretically, rejection sampling could also be used to sample P (X|a) from P (x), but this approach is highly inefficient in practice. Fudge [161] generates text conditioned on a desired attribute a (for eg., topic control in language generation, degree of formality in machine translation, poetry couplet completion) while only accessing the output proba- bilities P (X) of generative model G. Given an incomplete sequence prefix, the model trains binary discriminative models for one or multiple desired attributes to predict whether the attribute(s) will be fulfilled in the future complete sequence, therefore evaluation is an important challenge. The output probabilities of the discriminator(s) are then multiplied with the output logits of the generator G to adjust the original probabilities of G accounting for desired attribute(s) a and model P (X|a) via a Bayesian decomposition.

PPLM [22] combines a pre-trained language model with attribute classifiers that guide generation towards specific topics and sentiment styles. These classifiers are trained on top of the last hidden layer of the pre-trained language model, and gradients from the classifiers are backpropagated to update the hidden representations of the language model and steer generation in desirable directions. While PPLM achieves fine-grained control of content and style attributes via a simple gradient-based sampling mechanism, the approach is computationally intensive and inefficient as it requires multiple forward and backward passes for each generation step. Plug- and-play methods have been used to control large pre-trained conversational models such as GPT-2 [123] using a variety of styles (positive and negative sentiment) and topics (Question, Sport, Business, Finance) [99]. Undoubtedly, more effort needs to be focused on collecting datasets for constrained text generation that capture many possible real-world constraints. GeDi [73] guides language generation from large language models towards desired attributes by using generative discrim- inators to compute classification likelihoods for all candidate next tokens on the fly at generation time. Given a class- conditional language model conditioned both on a desired attribute c+ and an undesired attribute c−, GeDi-guided con- trastive generation uses the two instances of the model as discriminative classifiers to contrast and filter out common attributes between the two classes c+ and c−; then aspects of the desired attribute c+ are transferred across domains via weighted decoding and filtering. The contrast between a positive and a negative class conditional distribution is employed both at training and inference time to control the bias, toxicity and negativity of GPT-2 [123] and GPT-3 [11].

D. Edit based approaches
a) Utility constraints

Edit based approaches rely on the key idea that changing only a few words or phrases which are indicative of a particular attribute are sufficient to alter the style of a given piece of text. For example, the sentiment of a sentence can be altered from negative to positive by first identifying negative attribute markers (”bad”, ”worst”, ”disappointed”), deleting these negative attributes while keeping other content words fixed, and then generating the final output via a recurrent decoder which conditions on the extracted content words and the target attribute [85]. Leaving from the observation that humans write text in incremental passes with multiple revisions, a prototype-then-edit model first samples a prototype sentence from the training corpus and then edits it conditioned on an edit vector [47]. Noticeably, text generation based on editing a prototype is much easier com- pared to generating text from scratch. Also building upon the ”Delete Retrieve Generate” framework, the Generative Style Transformer [147] incorporates a neural mechanism to delete style attributes from the source sentence based on the attention weights of a Transformer model (Delete Transformer), and then generates sentences in the desired target style by decoding with a pre-trained GPT-2 [122] model.

b) Lexical constraints

Constrained sentence generation by Metropolis-Hastings sampling [103] first inserts all con- straint keywords in a template in random order, then samples local edit operations (word replacement, deletion or insertion) to perform at specific positions for improving sentence fluency. The probability of each edit operation being accepted or rejected is determined by a language model, however indi- vidually sampling each token results in slow convergence. Instead of randomly sampling edit operations, the gradient of a differentiable objective function is used to determine where and how to edit [137].

E. Adapting existing models and architectures to accommo- date constraints

It is non-trivial to impose constraints on existing deep learning models while maintaining high generation quality since their model architecture is designed to generate sen- tences sequentially from left to right. While current deep learning models are lacking the expressiveness to incorporate constraints at training time and at arbitrary positions in the generated sequence, well known models and architectures are adapted to accommodate constraints through a set of custom engineered approaches. We present these methods below.

a) Lexical constraints

Current architectures used for language generation produce texts sequentially from the first word to the last word, and it is non-trivial to impose lex- ical constraints on left-to-right generation while maintain- ing high output quality for natural and fluent texts. Current workarounds for hard lexically constrained text generation address this limitation by generating texts in a non-monotonic fashion when employing forward-backward language models. The backward language model takes a lexical constraint as input, considers it as the starting point and generates the first half of the sentence backwards conditioned on the topic word, while the forward language model takes as input the sequence generated by the backward generator and produces its sentence completion in normal order conditioned on the backward generated sequence. While the topic word can occur at any position in the sentence, this approach can only generate output constrained on at most one lexical constraint; gener- ating sequences with multiple lexical constraints is an open research problem. Moreover, these approaches adapt existing frameworks for constrained text generation by splitting a sentence into two parts, which is unnatural and also hurts fluency when generating half of the sequence in reverse order. Given a topic word at an arbitrary position in a scientific paper title, a recurrent language model is tasked with generat- ing both past and future words in the title conditioned on the given topic [105]. Similarly, on-topic dialogue responses that satisfy hard lexical constraints are generated with a ”sequence to backward and forward sequences” (seq2bf) model [106] which first predicts a keyword noun that reflects the gist of the response, then decodes the response backward and forward starting from the given word. BFGAN [90] employs GANs for lexically constrained text generation using GANs. The model incorporates three modules, namely a backward generator and a forward generator which collaborate on generating lexically constrained sentences, and a discriminator which guides the joint training with policy gradient of the two generators. BFGAN is used to generate Amazon product reviews and conversational responses with lexical constraints.

Generating a fluent sequence which simultaneously satis- fies multiple lexical constraints employs a backward-forward LSTM language model to first generate the sequence from a user-defined verb constraint and then satisfy other lexical constraints by word embedding substitution based on cosine similarity between generated tokens and desired constraints [78]. Nevertheless, the approach assumes a verb constraint is always specified in the set of lexical constraints. Semantic and Utility constraints Steering neural models in specific directions is achieved by: i) adding special tokens at the beginning or end of the source text, ii) incorporating ad- ditional conditions into the decoder hidden states and iii) con- necting the conditions directly to the decoder output layer. A topic aware sequence-to-sequence model is used to generate on-topic conversational responses by conditioning the decoder on specific topic words [160]. Imposing conversational goals on dialogue agents aims to guide the conversation towards a designated target subject by combining coarse-grained topic constraints with discourse-level rules [148]. Generating emo- tional responses in neural conversational systems is achieved

by feeding the emotion category embedding to a sequence- to-sequence decoder [174]. Personalized chit-chat dialogue agents that display consistent personalities, viewpoints and are configurable depending on attributes of the system user are used to produce more personal, specific and engaging dialogue responses [153], [9], [164]. Nevertheless, finding the proper balance between fluency, engagement, consistency and a persistent personality remains an open challenge for current dialogue models due to lack of a measurable objective function and correspondingly suitable evaluation metrics. While we can easily judge whether or not an output satisfies one constraint, it is hard to judge the extent to which (or “how much”) it actually satisfies the constraint, and it is even harder to jointly model/ measure multiple constraints. Moreover, accounting for repetition and diversity is important as these models often get stuck in an infinite loop of redundant, dull, generic and universally relevant responses that carry little meaning [84], [131], [106].

For integrating factual knowledge into open-ended con- versational systems, factoid and entity-rich web documents are encoded altogether with the conversation history into the same representation which is passed to an attentional neural decoder that generates the response tokens. Similarly, speaker-level representations are integrated into seq2seq con- versational models for generating personalized conversation responses [82]. Fact-guided sentence modification for dynam- ically rewriting, updating or correcting articles according to changing information is an instance of constrained text gener- ation which presents the particular challenge that the rewritten sentence needs to be consistent with an input claim while at the same time preserve non-contradicting content [138]. Given the claim and an old sentence, an updated sentence is produced by first identifying contradictory components in the input sen- tence, masking these, then using the residual sentence and the claim as input into a two encoder sequence-to-sequence model with copy attention to produce the update sentence consistent with the claim. Syntactically controlled paraphrase generation produces paraphrases of an input sentence by constraining the system on the target syntactic form [58], however not many syntactically constrained datasets to learn from are available. Controllable story generation based on RNNs is used to influence the story ending valence (whether happy or sad) and the storyline (specified as a sequence of words) [118]. Story-telling methods commonly use a hierarchical approach to thematically consistent story generation, by first generat- ing a prompt describing the topic for the story, and then constraining on the prompt for generating the story content [25]; additionally, constraints on the presence of entities are included as well [20]. Open-domain story generation requires composing coherent natural language texts that describe plau- sible sequence of events and is more challenging compared to generating stories in a narrow domain given an existing plot. Unsupervised machine translation methods are adapted for the task of text-style transfer by incorporating stylistic con- straints in a neural seq2seq model with attention and us- ing a style classifier to guarantee the accuracy of style transfer [169], or for control over multiple style attributes, including gender, sentiment or product type [77]. In machine translation, honorifics constraints are important for producing socially appropriate forms of address and controling the level of courtesy [133]; the system user defines the desired level of politeness of the translation, however these user-defined constraints are only soft constraints and can be overridden by the attentional encoder-decoder machine translation system whenever the source text provides strong politeness clues.

For effective imposition of semantic structure in constrained text generation, latent space representations need to be disen- tangled [62], such that varying an individual latent code will only change a single desired attribute. VAEs can achieve mean- ingful latent representations with designated semantics when combined with attribute discriminators and optimized end- to-end with differentiable softmax approximation [55]; this allows to generate sentences with constraints on sentiment and tense. Given an input sequence and a set of labels, sequence transduction with multi-space variational autoencoders [173] generates an output sequence that alters the content of the input sequence according to the constraints specified by the labels; the method is used for morphological inflection in multiple languages. In general, constrained text generation approaches assume that constraints need to be known a priori; however, this is not always possible, for eg., when suggesting alternative phrases for search queries in real-time, or when generating responses in dialogue systems according to the dynamics of the conversational context. Recent constrained text generation approaches control attributes of a generated sequence based on another sentence example: given two sentences X and Y , the goal is to generate a new sentence Z that follows the semantics of X and the syntax of Y . To this end, a VAE model with two latent variables is used to achieve disentanglement in the continuous latent space between syntax and semantics [16], [6]. Topic guided VAEs [154] use a Gaussian mixture model prior where each mixture component corresponds to a latent topic extracted from data as opposed to using pre-defined parameter settings which do not incorporate semantic meaning into the latent codes; the model is used for text summarization with designated topic guidance. Abstractive and extractive sentence compression with VAEs assumes the existence of a background language model from which a latent summary sentence is drawn first, and then the observed sentence is generated conditioned on the latent summary [104]; the model is able to balance copying a word from the source sentence with generating it from the background distribution. Iterative refinement of a sequence to transform it into another sequence with desired attributes exploits geometry of the latent space to produce incremental higher-quality revisions with theoretical guarantees in the combinatorial space of sequence elements [107], [139]. Such latent variable manipulations allow to rewrite modern text in the language of Shakespeare, improve sentence positivity, address word substitution and word order recovery tasks without need for any revision examples. Con- straints on the use of metaphor and personification in poems are incorporated in a conditional VAE with a rhetorically controlled decoder trained to emit meaningful and diverse rhetoric and overcome generic sentences [92]. Variational neural machine translation [163] incorporates a continuous latent variable to model the underlying semantics of sentence pairs. Nevertheless, efficiently performing posterior inference and large-scale training during the incorporation of latent variables remains an open challenge for constrained VAEs.

Modifying textual attributes of sentences including sen- timent, style, tense, voice, mood and negation is achieved by incorporating conditioning information into a neural encoder-decoder model, and optimizing a reconstruction loss which interpolates between auto-encoding and back-translation components to encourage content compatibility, as well as an adversarial loss which encourages sentence-level stylistic attribute compatibility [93]. The model allows simultaneous conditioning on multiple textual attributes, however the extent to which the generated sentences match the conditioning infor- mation requires new objective evaluation metrics for attribute accuracy and content compatibility/ preservation. Style transfer between scientific papers and newspapers is performed with separate style decoders, or by generating both content and style from the same decoder [32]. In poetry generation, it is common to impose hard con- straints on rhyme, rhythm, and topic [38], [39]. Given a user- supplied topic, the poetry generation algorithm first generates a large set of on-topic words and phrases, assigns rhyming words and phrases to specific lines, and then combines finite-state machinery with an RNN language model to score plausible poems that meet the desired constraints. While augmenting an RNN with a working memory to explicitly maintain a limited history of generated topics and context, coherence in meaning and topics across the overall poem remains an important challenge [166]. Constrained recurrent models are also used to generate online product reviews of certain topic, sentiment, style and length [28], affective dialogue responses [40], or for modeling participant roles and topics in conversational systems [102].

b) Format and Utility constraints: Text simplification models parameterized on constraints such as length, amount of paraphrasing, degree of lexical and syntactic complexity are used for generating texts easier to read and understand with simpler grammar and structure [100]. Towards a similar goal of controlling the degree of lexical complexity, the training loss function is changed to assign weights to words based on their complexity level [112]. In text summarization, constraints on the output sequence length for neural encoder-decoder models are specified as length embeddings and are passed as additional input to the decoder [67].

Faithfulness in abstractive text summarization is enforced in a seq2seq model by conditioning on both the source text and extracted factual descriptions [13]; this helps avoid generating false facts in the output summary. Hybrid text summarization approaches combine an unsupervised sentence extractor which selects salient sentences from the input document with a sentence abstractor that paraphrases each extracted sentence to overcome limitations of parallel aligned datasets [111]. Reinforcement learning is used in the context of con- strained natural language generation to directly optimize non- differentiable reward functions and evaluation metrics. While any user-defined reward function can be employed for training, most frequently optimized metrics with RL are BLEU for ma- chine translation [124], ROUGE for text summarization [124], [117], [158], [35], or human-defined conversation metrics focused on coherence, informativeness, sentiment, politeness, toxicity, question, repetition or semantic similarity [84], [128], [158]. However, manually defined reward functions based on heuristics cannot cover all crucial aspects of a natural realistic conversation [9], [35]. In addition, rewards are commonly modeled at the word level accounting for the probability of generating each word in a sentence [124], [59]; such low- level control makes credit assignment challenging since the number of actions available to the RL agent is equivalent to the number of words in the vocabulary. Defining a global score that measures complex aspects of text quality beyond local n-gram patterns and which can reliably approximate human judgments of text quality remains an open challenge [9].

In the RL framework the generative model is seen as an agent with parameters that define a policy and which interacts with an external environment by taking actions, receives a reward once it reaches the end of a sequence and updates its internal state consequently. To this end, policy gradient methods are used to train text generative models and alleviate issues such as exposure bias and loss functions which do not operate at the sequence level. However, policy gradient algorithms present large variance and generally struggle in settings with large action spaces such as natural language generation. In addition, they take very long time to converge [18] and the improvement in the optimized metrics is not always reflected in human evaluations of text quality. Training RL models to optimize n-gram evaluation measures based on local patterns provides only a limited and myopic perspective of overall text quality and does not necessarily lead to better text quality, overall coherence or discourse structure [9]. Moreover, fine-tuning on such measures may yield deteriorated outputs despite increased automatic scores, while difficulty in constrained optimization with RL often leads to sparse, non- informative and delayed reward signals. Learning RL rewards from human preferences aims to incorporate human feedback in text generation. Neural reward learning schemes train neural teachers that learn to score an ordered sequence of sentences and formulate rewards that guide coherent long text generation [9]; the approach is used for generating cooking recipes given the dish title and the set of ingredients as constraints. Learning-to-rank algorithms are used to approximate ground-truth oracle rewards in extractive multi-document summarization to indicate the quality of a summary or preferences over summary pairs [35]. Machine learnability of human rewards in neural machine translation models is approached by first training reward estimators on re- wards collected from offline logs, then integrating these reward estimators in an off-policy RL setting [74]. Similarly, implicit human reactions such as sentiment or length of a conversation are used to learn rewards for fine-tuning off-policy RL models for dialog [59]. Nevertheless, human feedback is noisy, not well-defined, complex and inconsistent. Using RL to improve system outputs with respect to human-centered metrics of conversation quality is highly dependent on developing robust metrics tailored to the particular application domain, for eg. increasing politeness of a technical-support system or reducing toxicity of generated language.

Hard-constrained text generation in a non-monotonic order relies on a tree-based text generation scheme, where a word is generated at an arbitrary position in the sentence, then binary trees of words to its left and right are recursively generated [155]. Learning proceeds in an incremental fashion in an imitation learning framework, where the policy gradually moves from imitating the oracle to reinforcing its own prefer- ences and generating texts without a pre-specified word order. Nevertheless, the time complexity of the approach is O(n), same as for autoregressive models and the constructed tree does not reflect a high-level to low-level hierarchy of concepts.

VI. CONSTRAINED NLG EVALUATION

Evaluation of constrained text generation is performed using the same evaluation approaches and methodologies available in the natural language generation literature. In general, evaluation of the generated text is largely an unsolved and notoriously difficult problem [8]. Currently, there is no well-established consensus on how NLG systems should be evaluated, [79], [42], and the lack of meaningful quantitative evaluation metrics to accurately assess the quality of trained models is detrimental to the progress of the field. In the absence of well established evaluation measures, natural lan- guage evaluations are carried in a rather ad-hoc manner with a lot of variability across the proposed models and tasks on inconsistent benchmarks, resulting in misleading performance measures. Subjective evaluations based on visual inspection of the generated samples are often lack scientific rigour and make it difficult to quantify and judge precisely the quality of a generative model [49]. In what follows we review the main methods for constrained text generation evaluation.

a) Lexical constraints: Measuring how many of the given lexical constraints are included in the generated outputs is done using concept coverage [86], [95]; the metric is computed as the the average percentage of input concepts that are present in the lemmatized outputs.

b) Semantic and syntactic constraints: Surface similarity based on n-gram overlap metrics, such as BLEU [115], ROUGE [87], METEOR [5] measure to what extent the generative model can preserve content by retaining words commonly shared between the generated output and ground- truth references. Such metrics are commonly used to measure response relevance in dialogue systems [33], [82], translation quality in neural machine translation [133], assess summary quality in text summarization [130]. In general, the correlation between word overlap metrics and true text quality is a widely debated topic [84]. Evaluation metrics based on local n-gram patterns only provide a limited and myopic perspective of overall text quality and are notoriously poor at evaluating dialogue systems [89], [131], [9].

Perplexity [60] based evaluation metrics are used to evaluate and compare language models, and measure the fluency and diversity of the generated samples [99], [9], [82]. Reverse Perplexity [170] and Forward Perplexity [68] scores are cal- culated by training language models on synthetic samples, respectively real samples, and then using these trained models to measure perplexity real samples, respectively generated samples. Nevertheless, perplexity is a model dependent metric, and “how likely a sentence is generated by a given model” is not directly comparable across different models. Moreover, nu- merous studies find perplexity to be an inadequate measure of text quality [149], [27], since models with high likelihood can generate low-quality samples, while samples of good quality can present low likelihood. In addition, infinite perplexity can still be obtained from a perfect model even when its ability to generate test sentences is removed [49].

P, R, F1 are used to measure the distance of the generated samples to the real data manifold [96]. When precision is high, the generated samples are close to the data manifold, and when recall is high, the generator outputs samples that cover the manifold well. Metrics that aggregate precision and recall such as Fβ, a generalization of the F1 score, are used to quantify the relative importance of precision and recall [127]. Nevertheless, the data manifold of non-synthetic data is unknown and therefore impossible to compute in practice. Content diversity measures how different the generated sen- tences are from each other, by either considering word choice, topic and meaning [151], [41], [56], or by looking at the level of sentence interestingness or unlikeliness [49]. Perplexity on a reference set, n-gram diversity [81] and Self-BLEU [175] are commonly used measures of the diversity of the generated samples. In addition, Backward-BLEU [142] evaluates test data using the generated samples as reference; the higher the score the more diverse the generator output. Lexical diversity [2] calculates the ratio of unique tokens to the total number of generated tokens. Similarly, Distinct-k or Dist-k [81] measures the total number of unique k-grams normalized by the total number of generated k-gram tokens to avoid favoring long sentences. Nevertheless, the Dist-k metric ignores the fact that infrequent k-grams contribute more to diversity than frequent ones and assign same weight to all k-grams that appear at least once. Distinct-1 and Distinct-2 are used to measure the diversity of constrained conversational responses [3], [167] and rhetoric constrained generated poems [92]. Entropy based metrics such as Ent-k [167] reflect the frequency difference of k-grams and to analyze the information content of the generated responses in dialogue systems [135], [106]. Unlike traditional evaluation metrics based on heuristics, learnable metrics train machine learning models on human annotated datasets to learn a scoring function that reproduces human judgements. Fully-learnt metrics leverage existing datasets of human ratings to learn automated evaluation met- rics that fit the human data distribution, and can be tuned to measure specific properties of the generated texts, such as fluency, style, grammaticality, fidelity, etc. Linear regres- sion based on human judgements is used to learn a model for scoring system summaries [119]. RUSE [143] combines sentence embeddings in a multi-layer perceptron regressor model. ESIM [17], [101] feeds the encoded representations of the candidate and the reference sentence into a feedforward regressor. BLEURT [132] fine-tunes BERT [23] on human ratings datasets for similarity score prediction. MAUDE [144] is proposed for the evaluation of online dialogue conversations and leverages sentence representations from pre-trained BERT to train text encoders which can distinguish between valid dialogue responses and fake examples. BARTScore [162] formulates the evaluation of generated text as a text generation task from pre-trained language models and measures the weighted probability of the generated text given another text as input or output. Hybrid metrics combine learnt elements with human-defined logical rules, for example, contextual embeddings with token alignment rules. BERTscore [165] evaluates generated text against gold standard references using soft-string similarity matches (i.e. cosine similarity) computed on pre-trained contextualized BERT [23] token embeddings. MoverScore [171] combines contextualized representations of system and reference texts with semantic measures of distance computed using Word Mover’s Distance [75]; the metric is extended to evaluate multi-sentence texts [19]. Human and sta- tistical evaluation are combined in HUSE [49], an evaluation framework which estimates the optimal error rate of predicting whether a piece of text is human-written or machine-generated. However, a limitation of learned evaluation metrics is that they generally fail to generalize well across different systems [14].

c) Utility constraints: A commonly used approach in the literature to assess whether generated texts have desirable attributes is to rely on an attribute classifier and measure the classification score, i.e. the fraction of outputs generated by the model having the desired attribute [55], [139], [85]. Adversarial evaluation [10], [64] employs an evaluator trained to distinguish machine-generated from human-written texts, analogous to the discriminator in GANs [44]. On this note, pre-trained attribute classifiers and class-specific discrimi- nators measure how well the generated samples match the conditioning labels on attributes such as sentiment, tense, voice, mood and negation [93], [83], [12], and guarantee the accuracy of stylistic text transfer [169], [139]. GLEU [110] was originally proposed for grammatical error correction, and later adopted for the evaluation of text style transfer since both tasks require localized edits to the input sentence; GLEU is found to present a reasonable balance between target style match and content retention [147].

Readability metrics such as Flesch-Kincaid Grade Leveln[69] and Flesch Reading Ease [29] are used to account for simplicity and measure the reading difficulty of a piece of text. Both metrics are computed as linear combinations of the number of words per sentence and number of syllables per word with different weighting factors.

All constraints While automated evaluation helps assess generated texts quickly and cheaply, the use of automated eval- uation metrics is dependent upon their correlation with human judgements of quality [30]. Human evaluations remain the gold-standard in natural language generation and automated evaluation metrics can only be used as a proxy for human judgements only when there is reasonable correlation with human decisions. Ideally, automated evaluations are carried simultaneously with human annotation studies, and not as a replacement of human evaluations. In text style transfer, human evaluations are conducted to determine how accurately constrained text generation methods identify stylistic textual attributes in the source input and replace these with desired target attributes in generated sentences [147]. In conversational systems, responses generated by open-domain chatbots are evaluated across two dimensions: i) humanness, as a proxy for the fluency and coherence of the generated responses, and ii) attribute consistency, to determine whether the style and topic enforced by the generation model are well captured [99]. Human evaluations are also carried to determine the plausability of the generated response, as well as to measure its content richness and how much new information it adds to the conversation) [3]. Outputs generated by neural con- versational systems are also assessed for quality, style and topic to determine whether the acquisition of styles of famous personalities, characters, or professionals is achievable, and whether the topic of the conversation can be influenced in particular directions [153].

VII. DISCUSSION AND OPEN CHALLENGES

In what follows we review the main challenges associ- ated with constrained text generation outlining why these challenges have not been solved yet, and present the most promising research directions to focus on next.

In our view, constrained text generation is a more diffi- cult problem compared to other instances of text generation. The difficulty arises from a multitude of factors, including lack of model expressiveness which makes it difficult for current models to incorporate constraints into the objective function, lack of suitable evaluation metrics to assess the extent to which constraints are satisfied (and this becomes even more challenging when there are multiple constraints present), difficulty in the constrained optimization of non- differentiable reward functions, and finally lack of constrained text generation datasets that are illustrative of a wide diver- sity of constraints. Due to these pressing unsolved issues, constrained text generation remains an open challenge in the research community. Advancing the state-of-the-art requires considerable more collective and focused effort. Below we identify the most promising directions for advancing the state- of-the-art for safe and robust constrained NLG.

Multiple constraint satisfaction Most approaches proposed for constrained text satisfaction focus on generating sentences that meet one single desired constraint, nevertheless generating sequences that simultaneously satisfy multiple lex- ical constraints is an important open research problem in text generative models [90], [78], [53]. While incorporating one constraint is already hard enough due to lack of model expres- siveness, incorporating multiple constraints poses significant challenges in terms of defining the loss function accounting for all the desired constraints, difficulty in optimizing it and evaluating whether each constraint is satisfied. Approaches that convert the multiple constraint satisfaction problem into allowing the inclusion of pre-specified lexical constraints at de- coding time are not optimal either: on the one hand, decoding complexity increases exponentially or linearly in the number of constraints, and on the other hand forcing constraints at every step of the generation process impacts the quality and naturalness of generated texts [120]. Moreover, many model architectures are designed for sequential sentence generation only (vs. non-monotonic text generation) and it is non-trivial to impose decoding time constraints while maintaining optimal text generation quality [103].

Dynamically defined constraints Current approaches to constrained text generation assume there is prior knowledge of the constrained textual attributes and the finite set of values these attributes can take on. Nevertheless, there are situations when it may be desirable to impose constraints dynamically, for eg. in conversational systems depending on the system user’s statements, reactions and emotions. When dynamically defining constraints, the main challenges are the lack of model expressiveness and robust ways to evaluate whether these con- straints are satisfied. In the literature, controling the realization of a sentence based on another’s sentence syntax and semantics is a less explored setting for constrained text generation with dynamic constraints which does not require prior knowledge of all the values the control variable might take on [16]. To this end, disentangled latent space representations of syntax and semantics are essential for the manipulation sentence attributes in tasks such as unsupervised paraphrase generation and syntax-transfer generation [6]. Generative reasoning Current large-scale text generation models display impressive ability to generate fluent texts, nevertheless composing realistically plausible sentences in the presence of constraints remains a significant open challenge. This is illustrative of all challenges associated with constrained text generation, including lack of model expressiveness, lack of suitable evaluation metrics, difficulty in constrained op- timization and lack of contrained text generation datasets. Nevertheless, endowing generative models with commonsense reasoning abilities is an important milestone towards advanc- ing machine understanding and intelligence. CommonGen [86] benchmark proposes the task of constrained text generation with generative commonsense reasoning, where given a set of concepts the task is to generate a coherent sentence de- scribing an everyday scenario using the given concepts. To do this successfully, the generative model must reason over commonsense relations between the given concepts (relational reasoning), and infer novel combinations of familiar concepts (compositional generalization). Preliminary analysis shows that current state-of-the-art pre-trained models struggle at the task and generate implausible sentences by a large margin.

Attribute specific datasets The lack of annotated datasets for attribute specific text generation constitutes a bottleneck in the development and adaptation of models for tasks that require fine-grained control over style and topics. For example, in dialogue systems the absence of attribute annotated conver- sational datasets that can be used for fine-tuning large scale pre-trained models limits control over the generated responses for a desired attribute [99]. Moreover, such attribute annotated datasets can help with the personalization of dialogue systems, make dialogues safe, supportive and engaging [136], [164]. Personalized dialogue agents that display consistent person- alities and viewpoints overcome the unsatisfying experience of a persona-free chit-chat model. Nevertheless, imposing conversational goals on a dialogue agent for learning target- guided strategies requires keyword-augmented conversation datasets for learning how to steer the conversation towards a designated target subject [148]. The collection of datasets that capture a wide diversity of constraints and are representative of many real world situations are critical for advancing safe

and robust constrained text generation. Existing benchmarks focused on politeness [98], formality [125], sentiment [139], writing style [61] are rather limited in nature and do not offer fine-grained control over stylistic attributes. StylePTB [97] aims to allow compositional transfer over a wider range of fine-grained stylistic constructs, including lexical, semantic, stylistic and thematic transfers.

Rule constraints While most research that is currently trying to address constrained text generation is focusing on the incorporation of pre-defined utility or lexical constraints, the satisfaction of rule based constraints is equally relevant, partic- ularly when used to define format and syntactic conditions on the output. However, the lack of model expressiveness makes it challenging to incorporate rule based constraints into the loss function at training time. We encourage more effort in this direction likely to open a plethora of new possibilities in how constraints are specified, incorporated and satisfied in models particularly designed for constrained neural text generation.

Evaluation of constrained text generation In general, evaluation of text generative models is an open challenge. The field is missing robust automated evaluation metrics that correlate with human judgements across multiple dimensions of text quality. Evaluation of models for constrained text generation is currently done using the same flawed existing metrics commonly used in unconditional and conditional text generation evaluation, or in an informal way often times in the absence of a rigorous evaluation procedure. Human evaluation remains the gold standard way to assess text quality, however designing evaluation metrics tailored specifically at assessing whether generated texts meet desired constraints altogether with new benchmark datasets for the evaluation of constrained sequence generation are important next steps [78].

Adversarial Attacks Adversarial examples exploit vulner- abilities in text generation models and represent an active research area. Adversarial triggers in the form of input- agnostic sequences of tokens concatenated to any input dataset can trigger a pre-trained language models to produce biased, racist and discriminatory outputs even when these models are carefully fine-tuned and optimized against adversarial triggers [152]. Gradient-based adversarial trigger phrase search tech- niques are used to generate input prompts to a pre-language model that induce biases in the generated output and allows to study strategies for bias mitigation [141]. Constrained text generation models that are robust to adversarial attacks are needed for the beneficial use of machine learning and artificial intelligence technology in real world applications, as well as to mitigate any potential societal harms and biases associated with the deployment of large pre-trained language models.

While the above directions outline some of the most press- ing research challenges associated with constrained text gen- eration, it is nevertheless a non-exhaustive list of all research problems that need increased attention. Other important open challenges include the use of constrained text generation for personalized agents in a wide variety of contexts, such as in dialogue settings [164], and new benchmark datasets that are reflective of real-world constraints for both training/ fine- tuning and evaluating constrained text generation models.

VIII. CONCLUSION

In this work, we have presented the reasons why con- strained natural language generation is an important, yet highly challenging and largely unsolved research problem. Our first contribution consists in clarifying the difference between the ambiguous use of unconditional, conditional and constrained terms in the natural language generation literature, and draw clear boundaries between these concepts by exemplifying instances of natural language generation tasks with their asso- ciated conditions and constraints. Among different paradigms of text generation, we consider constrained text generation to be particularly challenging (if not the most challenging), yet also extremely useful. We identify general reasons why constrained natural language generation deserves significant more attention in the research community, including the lack of model expressiveness in incorporating constraints into the objective function at training time, difficulty in constrained optimization algorithms, the lack of suitable evaluation metrics for robustly assessing, comparing model outputs and claiming success in constrained natural language generation, as well as the lack of constrained text generation datasets that are repre- sentative of a wide range of real-world constraints for training and fine-tuning these models. We then survey a representative body of recent literature on constrained text generation using neural networks, presenting the main approaches and methods used, as well as their limitations. We hope our work can serve as an informative guide for both researchers and practitioners to become familiar with the current methodology and main challenges, as well as an advocate for advancing the state-of- the-art in constrained natural language generation. We invite future work in solving the outlined challenges for better, useful, safer, robust constrained text generation and evaluation.

References

Here is the references section tidied up in style:

[1] P. Anderson et al., "Guided Open Vocabulary Image Captioning with Constrained Beam Search," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2017, pp. 936-945.

[2] K. Bache, D. Newman, and P. Smyth, "Text-based measures of document diversity," in Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2013, pp. 23-31.

[3] A. Baheti et al., "Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2018, pp. 3970-3980.

[4] A. Balakrishnan et al., "Constrained Decoding for Neural NLG from Compositional Representations in Task-Oriented Dialogue," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 831-844.

[5] S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proc. ACL Workshop Intrinsic Extrinsic Evaluation Measures Mach. Trans. Summariz., 2005, pp. 65-72.

[6] Y. Bao et al., "Generating Sentences from Disentangled Syntactic and Semantic Spaces," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 6008-6019.

[7] Y. Bengio et al., "A neural probabilistic language model," J. Mach. Learn. Res., vol. 3, pp. 1137-1155, Feb. 2003.

[8] A. Borji, "Pros and cons of gan evaluation measures," Comput. Vis. Image Understand., vol. 179, pp. 41-65, 2019.

[9] A. Bosselut et al., "Discourse-Aware Neural Rewards for Coherent Text Generation," in Proc. NAACL-HLT, 2018.

[10] S. R. Bowman et al., "Generating sentences from a continuous space," arXiv:1511.06349, 2015.

[11] T. B. Brown et al., "Language Models are Few-Shot Learners," arXiv:2005.14165, 2020.

[12] E. Bruni and R. Fernández, "Adversarial evaluation for open-domain dialogue generation," in Proc. 18th Annu. SIGDIAL Meeting Discourse Dialogue, 2017, pp. 284-288.

[13] Z. Cao et al., "Faithful to the original: Fact aware neural abstractive summarization," in Proc. AAAI Conf. Artif. Intell., 2018.

[14] A. T. Chaganty, S. Mussmann, and P. Liang, "The price of debiasing automatic metrics in natural language evaluation," arXiv:1807.02202, 2018.

[15] W. Chan et al., "KERMIT: Generative insertion-based modeling for sequences," arXiv:1906.01604, 2019.

[16] M. Chen et al., "Controllable Paraphrase Generation with a Syntactic Exemplar," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 5972-5984.

[17] Q. Chen et al., "Enhanced LSTM for Natural Language Inference," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2017, pp. 1657-1668.

[18] L. Choshen et al., "On the Weaknesses of Reinforcement Learning for Neural Machine Translation," arXiv:1907.01752, 2019.

[19] E. Clark, A. Celikyilmaz, and N. A. Smith, "Sentence mover's similarity: Automatic evaluation for multi-sentence texts," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 2748-2760.

[20] E. Clark, Y. Ji, and N. A. Smith, "Neural text generation in stories using entity representations as context," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, pp. 2250-2260.

[21] J. Crego et al., "Systran's pure neural machine translation systems," arXiv:1610.05540, 2016.

[22] S. Dathathri et al., "Plug and Play Language Models: a Simple Approach to Controlled Text Generation," arXiv:1912.02164, 2019.

[23] J. Devlin et al., "BERT: Pre-training of deep bidirectional transformers for language understanding," arXiv:1810.04805, 2018.

[24] A. Fan, D. Grangier, and M. Auli, "Controllable Abstractive Summarization," in Proc. Workshop Neural Mach. Trans. Generat., 2018, pp. 45-54.

[25] A. Fan, M. Lewis, and Y. Dauphin, "Hierarchical neural story generation," arXiv:1805.04833, 2018.

[26] A. Fan, M. Lewis, and Y. Dauphin, "Strategies for Structuring Story Generation," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 2650-2660.

[27] W. Fedus, I. Goodfellow, and A. M. Dai, "MaskGAN: Better Text Generation via Filling in the Blanks," in Proc. Int. Conf. Learn. Representations, 2018.

[28] J. Ficler and Y. Goldberg, "Controlling Linguistic Style Aspects in Neural Language Generation," in Proc. Workshop Stylistic Variation, 2017, pp. 94-104.

[29] R. F. Flesch, How to Write Plain English: A Book for Lawyers and Consumers. New York, NY, USA: Harpercollins, 1979.

[30] M. Fomicheva and L. Specia, "Taking MT Evaluation Metrics to Extremes: Beyond Correlation with Human Judgments," Comput. Linguistics, vol. 45, no. 3, pp. 515-558, 2019.

[31] Y. Fu and Y. Feng, "Natural answer generation with heterogeneous memory," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, pp. 185-195.

[32] Z. Fu et al., "Style transfer in text: Exploration and evaluation," in Proc. AAAI Conf. Artif. Intell., 2018.

[33] M. Galley et al., "deltaBLEU: A discriminative metric for generation tasks with intrinsically diverse targets," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2015, pp. 445-450.

[34] J. Gao, M. Galley, and L. Li, Neural Approaches to Conversational AI: Question Answering, Task-Oriented Dialogues and Social Chatbots. Boston, MA, USA: Now Foundations and Trends, 2019.

[35] Y. Gao et al., "Reward learning for efficient reinforcement learning in extractive document summarisation," arXiv:1907.12894, 2019.

[36] A. Gatt and E. Krahmer, "Survey of the state of the art in natural language generation: Core tasks, applications and evaluation," J. Artif. Intell. Res., vol. 61, pp. 65-170, 2018.

[37] P. Gervás, "Computational approaches to storytelling and creativity," AI Mag., vol. 30, no. 3, p. 49, 2009.

[38] M. Ghazvininejad et al., "Generating topical poetry," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2016, pp. 1183-1191.

[39] M. Ghazvininejad et al., "Hafez: An interactive poetry generation system," in Proc. ACL, System Demonstrations, 2017, pp. 43-48.

[40] S. Ghosh et al., "Affect-LM: A neural language model for customizable affective text generation," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2017, pp. 634-642.

[41] K. Gimpel et al., "A systematic exploration of diversity in machine translation," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2013, pp. 1100-1111.

[42] D. Gkatzia and S. Mahamood, "A snapshot of NLG evaluation practices 2005-2014," in Proc. 15th European Workshop Nat. Lang. Generat., 2015, pp. 57-60.

[43] S. Goldfarb-Tarrant, H. Feng, and N. Peng, "Plan, write, and revise: An interactive system for open-domain story generation," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics, 2019, pp. 89-97.

[44] I. Goodfellow et al., "Generative adversarial nets," in Proc. Adv. Neural Inf. Process. Syst., 2014, pp. 2672-2680.

[45] J. Gu, Q. Liu, and K. Cho, "Insertion-based decoding with automatically inferred generation order," arXiv:1902.01370, 2019.

[46] J. Gu, C. Wang, and J. Zhao, "Levenshtein transformer," arXiv:1905.11006, 2019.

[47] K. Guu et al., "Generating sentences by editing prototypes," Trans. Assoc. Comput. Linguistics, vol. 6, pp. 437-450, 2018.

[48] H. H. Lee et al., "RecipeGPT: Generative pre-training based cooking recipe generation and evaluation system," in Proc. Companion Web Conf., 2020, pp. 181-184.

[49] T. B. Hashimoto, H. Zhang, and P. Liang, "Unifying human and statistical evaluation for natural language generation," arXiv:1904.02792, 2019.

[50] C. Hokamp and Q. Liu, "Lexically constrained decoding for sequence generation using grid beam search," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2017, pp. 1535-1546.

[51] A. Holtzman et al., "Learning to write with cooperative discriminators," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 1638-1649.

[52] A. Holtzman et al., "The curious case of neural text degeneration," arXiv:1904.09751, 2019.

[53] L.-H. Hsieh, Y.-Y. Lee, and E.-P. Lim, "ENCONTER: Entity constrained progressive sequence generation via insertion-based transformer," arXiv:2103.09548, 2021.

[54] J. E. Hu et al., "Improved lexically constrained decoding for translation and monolingual rewriting," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2019, pp. 839-850.

[55] Z. Hu et al., "Toward controlled generation of text," in Proc. Int. Conf. Mach. Learn., 2017, pp. 1587-1596.

[56] D. Ippolito et al., "Comparison of diverse decoding methods from conditional language models," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2018.

[57] M. Iyyer et al., "A neural network for factoid question answering over paragraphs," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2014, pp. 633-644.

[58] M. Iyyer et al., "Adversarial example generation with syntactically controlled paraphrase networks," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, pp. 1875-1885.

[59] N. Jaques et al., "Way off-policy batch deep reinforcement learning of implicit human preferences in dialog," arXiv:1907.00456, 2019.

[60] F. Jelinek et al., "Perplexity—A measure of the difficulty of speech recognition tasks," J. Acoust. Soc. Amer., vol. 62, no. S1, pp. S63-S63, 1977.

[61] H. Jhamtani et al., "Shakespearizing modern language using copy-enriched sequence-to-sequence models," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2017, p. 10.

[62] V. John et al., "Disentangled representation learning for non-parallel text style transfer," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 424-434.

[63] T. Kajiwara, "Negative lexically constrained decoding for paraphrase generation," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2019, pp. 6047-6052.

[64] A. Kannan and O. Vinyals, "Adversarial evaluation of dialogue models," arXiv:1701.08198, 2017.

[65] N. S. Keskar et al., "CTRL: A conditional transformer language model for controllable generation," arXiv:1909.05858, 2019.

[66] M. Khalifa, H. Elsahar, and M. Dymetman, "A distributional approach to controlled text generation," arXiv:2012.11635, 2020.

[67] Y. Kikuchi et al., "Controlling output length in neural encoder-decoders," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2016.

[68] Y. Kim et al., "Adversarially regularized autoencoders for generating discrete structures," arXiv:1706.04223, 2017.

[69] J. P. Kincaid et al., "Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel," Rep. Research Branch, Chief of Naval Tech. Training, Naval Air Station Memphis, Millington, TN, USA, 1975.

[70] R. Knowles and P. Koehn, "Neural interactive translation prediction," in Proc. 12th Conf. Assoc. Mach. Trans. Americas, vol. 1, 2016, pp. 107-120.

[71] X. Kong et al., "An adversarial approach to high-quality, sentiment-controlled neural dialogue generation," arXiv:1901.07129, 2019.

[72] B. Kratzwald, A. Eigenmann, and S. Feuerriegel, "RankQA: Neural question answering with answer re-ranking," arXiv:1906.03008, 2019.

[73] B. Krause et al., "GeDi: Generative discriminator guided sequence generation," arXiv:2009.06367, 2020.

[74] J. Kreutzer, J. Uyheng, and S. Riezler, "Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2018, pp. 1777-1788.

[75] M. Kusner et al., "From word embeddings to document distances," in Proc. Int. Conf. Mach. Learn., 2015, pp. 957-966.

[76] P. Laban et al., "The summary loop: Learning to write abstractive summaries without examples," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2020, vol. 1.

[77] G. Lample et al., "Multiple-attribute text rewriting," presented at the Int. Conf. Learn. Representations, 2018.

[78] S. Latif et al., "Backward-forward sequence generative network for multiple lexical constraints," in Proc. Int. Conf. Artif. Intell. Appl. Innovations, Springer, 2020, pp. 39-50.

[79] C. van der Lee et al., "Best practices for the human evaluation of automatically generated text," in Proc. 12th Int. Conf. Nat. Lang. Generat., 2019, pp. 355-368.

[80] J. Lee, E. Mansimov, and K. Cho, "Deterministic non-autoregressive neural sequence modeling by iterative refinement," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2018, pp. 1173-1182.

[81] J. Li et al., "A diversity-promoting objective function for neural conversation models," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2016, pp. 110-119.

[82] J. Li et al., "A persona-based neural conversation model," in Proc. Annu. Meeting Assoc. Comput. Linguistics, 2016, pp. 994-1003.

[83] J. Li et al., "Adversarial learning for neural dialogue generation," arXiv:1701.06547, 2017.

[84] J. Li et al., "Deep reinforcement learning for dialogue generation," in Proc. Conf. Empirical Methods Nat. Lang. Process., 2016, pp. 1192-1202.

[85] J. Li et al., "Delete, retrieve, generate: A simple approach to sentiment and style transfer," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., 2018, pp. 1865-1874.

[86] B. Y. Lin et al., "CommonGen: A constrained text generation dataset towards generative commonsense reasoning," arXiv:1911.03705, 2019. Here are the remaining references tidied up:

[87] C.-Y. Lin, "ROUGE: A package for automatic evaluation of summaries," in Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74-81.

[88] A. Liu et al., "DEXPERTS: Decoding-Time Controlled Text Generation with Experts and Anti-Experts," unpublished.

[89] C.-W. Liu et al., "How not to evaluate your dialogue system: An empirical study of unsupervised evaluation metrics for dialogue response generation," arXiv:1603.08023, 2016.

[90] D. Liu et al., "BFGAN: Backward and forward generative adversarial networks for lexically constrained sentence generation," IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 12, pp. 2350-2361, Dec. 2019.

[91] P. J. Liu et al., "Generating Wikipedia by summarizing long sequences," arXiv:1801.10198, 2018.

[92] Z. Liu et al., "Rhetorically controlled encoder-decoder for modern Chinese poetry generation," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Florence, Italy, 2019, pp. 1992-2001.

[93] L. Logeswaran, H. Lee, and S. Bengio, "Content preserving text generation with attribute controls," presented at the Int. Conf. Learning Representations, Vancouver, BC, Canada, 2018.

[94] E. Loginova, S. Varanasi, and G. Neumann, "Towards multilingual neural question answering," in Proc. European Conf. Advances Databases Inf. Syst., Prague, Czech Republic, 2018, pp. 274-285.

[95] X. Lu et al., "NeuroLogic decoding: (Un)supervised neural text generation with predicate logic constraints," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., Online, 2021, pp. 4288-4299.

[96] M. Lucic et al., "Are GANs created equal? A large-scale study," in Proc. Adv. Neural Inf. Process. Syst., Montreal, QC, Canada, 2018, pp. 700-709.

[97] Y. Lyu et al., "StylePTB: A compositional benchmark for fine-grained controllable text style transfer," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., Online, 2021, pp. 2116-2138.

[98] A. Madaan et al., "Politeness transfer: A tag and generate approach," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Online, 2020, pp. 1869-1881.

[99] A. Madotto et al., "Plug-and-play conversational models," in Proc. Conf. Empirical Methods Nat. Lang. Process., Online, 2020, pp. 2422-2433.

[100] L. Martin et al., "Controllable sentence simplification," arXiv:1910.02677, 2019.

[101] N. Mathur, T. Baldwin, and T. Cohn, "Putting evaluation in context: Contextual embeddings improve machine translation evaluation," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Florence, Italy, 2019, pp. 2799-2808.

[102] H. Mei, M. Bansal, and M. R. Walter, "Coherent dialogue with attention-based language models," in Proc. AAAI Conf. Artif. Intell., Honolulu, HI, USA, 2017.

[103] N. Miao et al., "CGMH: Constrained sentence generation by Metropolis-Hastings sampling," in Proc. AAAI Conf. Artif. Intell., Honolulu, HI, USA, 2019, pp. 6834-6842.

[104] Y. Miao and P. Blunsom, "Language as a latent variable: Discrete generative models for sentence compression," in Proc. Conf. Empirical Methods Nat. Lang. Process., Austin, TX, USA, 2016, pp. 319-328.

[105] L. Mou et al., "Backward and forward language modeling for constrained sentence generation," arXiv:1512.06612, 2015.

[106] L. Mou et al., "Sequence to backward and forward sequences: A content-introducing approach to generative short-text conversation," arXiv:1607.00970, 2016.

[107] J. Mueller, D. Gifford, and T. Jaakkola, "Sequence to better sequence: Continuous revision of combinatorial structures," in Proc. Int. Conf. Mach. Learn., Sydney, NSW, Australia, 2017, pp. 2536-2544.

[108] R. Nallapati, F. Zhai, and B. Zhou, "SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents," in Proc. AAAI Conf. Artif. Intell., San Francisco, CA, USA, 2017.

[109] R. Nallapati et al., "Abstractive text summarization using sequence-to-sequence RNNs and beyond," arXiv:1602.06023, 2016.

[110] C. Napoles et al., "Ground truth for grammatical error correction metrics," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Beijing, China, 2015, pp. 588-593.

[111] N. I. Nikolov and R. Hahnloser, "Abstractive document summarization without parallel data," in Proc. 12th Int. Conf. Language Resources Evaluation, Marseille, France, 2020, pp. 6638-6644.

[112] D. Nishihara, T. Kajiwara, and Y. Arase, "Controllable text simplification with lexical constraint loss," in Proc. Annu. Meeting Assoc. Comput. Linguistics: Student Research Workshop, Florence, Italy, 2019, pp. 260-266.

[113] T. Niu and M. Bansal, "Polite dialogue generation without parallel data," Trans. Assoc. Comput. Linguistics, vol. 6, pp. 373-389, 2018.

[114] H. G. Oliveira, "A survey on intelligent poetry generation: Languages, features, techniques, reutilisation and evaluation," in Proc. 10th Int. Conf. Natural Lang. Generat., Santiago de Compostela, Spain, 2017, pp. 11-20.

[115] K. Papineni et al., "BLEU: a method for automatic evaluation of machine translation," in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, Philadelphia, PA, USA, 2002, pp. 311-318.

[116] R. Pasunuru, H. Guo, and M. Bansal, "DORB: Dynamically optimizing multiple rewards with bandits," in Proc. Conf. Empirical Methods Nat. Lang. Process., Online, 2020, pp. 7766-7780.

[117] R. Paulus, C. Xiong, and R. Socher, "A deep reinforced model for abstractive summarization," arXiv:1705.04304, 2017.

[118] N. Peng et al., "Towards controllable story generation," in Proc. Workshop Storytelling, New Orleans, LA, USA, 2018, pp. 43-49.

[119] M. Peyrard, T. Botschen, and I. Gurevych, "Learning to score system summaries for better content selection evaluation," in Proc. Workshop New Frontiers Summarization, Copenhagen, Denmark, 2017, pp. 74-84.

[120] M. Post and D. Vilar, "Fast lexically constrained decoding with dynamic beam allocation for neural machine translation," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., New Orleans, LA, USA, 2018, pp. 1314-1324.

[121] L. Qin et al., "Counterfactual story reasoning and generation," in Proc. Conf. Empirical Methods Nat. Lang. Process., Hong Kong, 2019, pp. 5046-5056.

[122] A. Radford et al., "Improving language understanding with unsupervised learning," OpenAI, San Francisco, CA, USA, Tech. Rep., 2018.

[123] A. Radford et al., "Language models are unsupervised multitask learners," OpenAI Blog, vol. 1, p. 8, 2019.

[124] M. Ranzato et al., "Sequence level training with recurrent neural networks," arXiv:1511.06732, 2015.

[125] S. Rao and J. Tetreault, "Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., New Orleans, LA, USA, 2018, pp. 129-140.

[126] A. M. Rush, S. Chopra, and J. Weston, "A neural attention model for abstractive sentence summarization," in Proc. Conf. Empirical Methods Nat. Lang. Process., Lisbon, Portugal, 2015, pp. 379-389.

[127] M. S. Sajjadi et al., "Assessing generative models via precision and recall," in Proc. Adv. Neural Inf. Process. Syst., Montreal, QC, Canada, 2018, pp. 5228-5237.

[128] A. Saleh et al., "Hierarchical reinforcement learning for open-domain dialog," arXiv:1909.07547, 2019.

[129] M. Schubotz et al., "Introducing MathQA - A math-aware question answering system," in Proc. ACM SIGIR Int. Conf. Theory Inf. Retrieval, Tianjin, China, 2018.

[130] A. See, P. J. Liu, and C. D. Manning, "Get to the point: Summarization with pointer-generator networks," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Vancouver, BC, Canada, 2017, pp. 1073-1083.

[131] A. See et al., "What makes a good conversation? How controllable attributes affect human judgments," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., Minneapolis, MN, USA, 2019, pp. 1702-1723.

[132] T. Sellam, D. Das, and A. P. Parikh, "BLEURT: Learning robust metrics for text generation," arXiv:2004.04696, 2020.

[133] R. Sennrich, B. Haddow, and A. Birch, "Controlling politeness in neural machine translation via side constraints," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., San Diego, CA, USA, 2016, pp. 35-40.

[134] I. V. Serban et al., "Building end-to-end dialogue systems using generative hierarchical neural network models," in Proc. AAAI Conf. Artif. Intell., Phoenix, AZ, USA, 2016.

[135] I. V. Serban et al., "A hierarchical latent variable encoder-decoder model for generating dialogues," in Proc. AAAI Conf. Artif. Intell., San Francisco, CA, USA, 2017.

[136] I. V. Serban et al., "A survey of available corpora for building data-driven dialogue systems," arXiv:1512.05742, 2015.

[137] L. Sha, "Gradient-guided unsupervised lexically constrained text generation," in Proc. Conf. Empirical Methods Nat. Lang. Process., Online, 2020, pp. 8692-8703.

[138] D. J. Shah, T. Schuster, and R. Barzilay, "Automatic fact-guided sentence modification," in Proc. AAAI Conf. Artif. Intell., New York, NY, USA, 2020, pp. 8791-8798.

[139] T. Shen et al., "Style transfer from non-parallel text by cross-alignment," in Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 6830-6841.

[140] E. Sheng et al., "The woman worked as a babysitter: On biases in language generation," in Proc. Conf. Empirical Methods Nat. Lang. Process., Hong Kong, 2019, pp. 3407-3412.

[141] E. Sheng et al., "Towards controllable biases in language generation," in Proc. Conf. Empirical Methods Nat. Lang. Process.: Findings, Online, 2020, pp. 3239-3254.

[142] Z. Shi et al., "Toward diverse text generation with inverse reinforcement learning," in Proc. 27th Int. Joint Conf. Artif. Intell., Stockholm, Sweden, 2018, pp. 4361-4367.

[143] H. Shimanaka, T. Kajiwara, and M. Komachi, "RUSE: Regressor using sentence embeddings for automatic machine translation evaluation," in Proc. 3rd Conf. Mach. Trans.: Shared Task Papers, Belgium, Brussels, 2018, pp. 751-758.

[144] K. Sinha et al., "Learning an unreferenced metric for online dialogue evaluation," arXiv:2005.00583, 2020.

[145] J. van Stegeren and M. Theune, "Narrative generation in the wild: Methods from NaNoGenMo," in Proc. 2nd Workshop Storytelling, Hong Kong, 2019, pp. 65-74.

[146] M. Stern et al., "Insertion transformer: Flexible sequence generation via insertion operations," in Proc. Int. Conf. Mach. Learn., Long Beach, CA, USA, 2019, pp. 5976-5985.

[147] A. Sudhakar, B. Upadhyay, and A. Maheswaran, "Transforming delete, retrieve, generate approach for controlled text style transfer," arXiv:1908.09368, 2019.

[148] J. Tang et al., "Target-guided open-domain conversation," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Florence, Italy, 2019, pp. 5624-5634.

[149] L. Theis, A. van den Oord, and M. Bethge, "A note on the evaluation of generative models," presented at the Int. Conf. Learning Representations, San Juan, Puerto Rico, 2016.

[150] A. Vaswani et al., "Attention is all you need," in Proc. Adv. Neural Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 5998-6008.

[151] A. K. Vijayakumar et al., "Diverse beam search: Decoding diverse solutions from neural sequence models," arXiv:1610.02424, 2016.

[152] E. Wallace et al., "Universal adversarial triggers for attacking and analyzing NLP," in Proc. Conf. Empirical Methods Nat. Lang. Process., Hong Kong, 2019.

[153] D. Wang et al., "Steering output style and topic in neural response generation," in Proc. Conf. Empirical Methods Nat. Lang. Process., Copenhagen, Denmark, 2017, pp. 2140-2150.

[154] W. Wang et al., "Topic-guided variational auto-encoder for text generation," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., Minneapolis, MN, USA, 2019, pp. 166-177.

[155] S. Welleck et al., "Non-monotonic sequential text generation," arXiv:1902.02192, 2019.

[156] G. Wiese, D. Weissenborn, and M. Neves, "Neural domain adaptation for biomedical question answering," in Proc. 21st Conf. Comput. Natural Lang. Learn., Vancouver, BC, Canada, 2017, pp. 281-289.

[157] S. Wiseman, S. M. Shieber, and A. M. Rush, "Challenges in data-to-document generation," arXiv:1707.08052, 2017.

[158] Y. Wu and B. Hu, "Learning to extract coherent summary via deep reinforcement learning," in Proc. AAAI Conf. Artif. Intell., New Orleans, LA, USA, 2018.

[159] J. Wuebker et al., "Models and inference for prefix-constrained machine translation," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Berlin, Germany, 2016, pp. 66-75.

[160] C. Xing et al., "Topic augmented neural response generation with a joint attention mechanism," arXiv:1606.08340, 2016.

[161] K. Yang and D. Klein, "FUDGE: Controlled text generation with future discriminators," in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Lang. Technol., Online, 2021, pp. 3511-3535.

[162] W. Yuan, G. Neubig, and P. Liu, "BARTScore: Evaluating generated text as text generation," arXiv:2106.11520, 2021.

[163] B. Zhang et al., "Variational neural machine translation," in Proc. Conf. Empirical Methods Nat. Lang. Process., Austin, TX, USA, 2016, pp. 521-530.

[164] S. Zhang et al., "Personalizing dialogue agents: I have a dog, do you have pets too?" in Proc. Annu. Meeting Assoc. Comput. Linguistics, Melbourne, Australia, 2018, pp. 2204-2213.

[165] T. Zhang et al., "BERTScore: Evaluating text generation with BERT," arXiv:1904.09675, 2019.

[166] X. Zhang and M. Lapata, "Chinese poetry generation with recurrent neural networks," in Proc. Conf. Empirical Methods Nat. Lang. Process., Doha, Qatar, 2014, pp. 670-680.

[167] Y. Zhang et al., "Generating informative and diverse conversational responses via adversarial information maximization," in Proc. Adv. Neural Inf. Process. Syst., Montreal, QC, Canada, 2018, pp. 1810-1820.

[168] Y. Zhang et al., "POINTER: Constrained text generation via insertion-based generative pre-training," arXiv:2005.00558, 2020.

[169] Z. Zhang et al., "Style transfer as unsupervised machine translation," arXiv:1808.07894, 2018.

[170] J. Zhao et al., "Adversarially regularized autoencoders," in Proc. 35th Int. Conf. Mach. Learn., Stockholm, Sweden, 2018, pp. 9405-9420.

[171] W. Zhao et al., "MoverScore: Text generation evaluation with contextualized embeddings and earth mover distance," in Proc. Conf. Empirical Methods Nat. Lang. Process., Hong Kong, 2019, pp. 563-578.

[172] Y. Zheng et al., "Personalized dialogue generation with diversified traits," arXiv:1901.09672, 2019.

[173] C. Zhou and G. Neubig, "Multi-space variational encoder-decoders for semi-supervised labeled sequence transduction," in Proc. Annu. Meeting Assoc. Comput. Linguistics, Vancouver, BC, Canada, 2017, pp. 310-320.

[174] H. Zhou et al., "Emotional chatting machine: Emotional conversation generation with internal and external memory," in Proc. AAAI Conf. Artif. Intell., New Orleans, LA, USA, 2018.

[175] Y. Zhu et al., "Texygen: A benchmarking platform for text generation models," in Proc. 41st Int. ACM SIGIR Conf. Research Dev. Inf. Retrieval, Ann Arbor, MI, USA, 2018, pp. 1097-1100.

[176] D. M. Ziegler et al., "Fine-tuning language models from human preferences," arXiv:1909.08593, 2019.


;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2022 WhyisConstrainedNeuralLanguageGQiaozhu Mei
Cristina Garbacea
Why is Constrained Neural Language Generation Particularly Challenging?10.48550/arXiv.2206.053952022