2023 PreTrainPromptandPredictASystem

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Prompt Engineering, Tuning-Free Prompting.

Notes

Cited By

Quotes

Abstract

This article surveys and organizes research works in a new paradigm in natural language processing, which we dub “prompt-based learning. Unlike traditional supervised learning, which trains a model to take in an input [math]\displaystyle{ x }[/math] and predict an output [math]\displaystyle{ y }[/math] as [math]\displaystyle{ P (y|x) }[/math], prompt-based learning is based on language models that model the probability of text directly. To use these models to perform prediction tasks, the original input [math]\displaystyle{ x }[/math] is modified using a template into a textual string prompt [math]\displaystyle{ x' }[/math] that has some unfilled slots, and then the language model is used to probabilistically fill the unfilled information to obtain a final string [math]\displaystyle{ \hat{x} }[/math], from which the final output y can be derived. This framework is powerful and attractive for a number of reasons: It allows the language model to be pre-trained on massive amounts of raw text, and by defining a new prompting function the model is able to perform few-shot or even zero-shot learning, adapting to new scenarios with few or no labeled data. In this article, we introduce the basics of this promising paradigm, describe a unified set of mathematical notations that can cover a wide variety of existing work, and organize existing work along several dimensions, e.g., the choice of pre-trained language models, prompts, and tuning strategies. To make the field more accessible to interested beginners, we not only make a systematic review of existing works and a highly structured typology of prompt-based concepts but also release other resources, e.g., a website NLPedia-Pretrain [1] including constantly updated survey resource and paperlist.

...

1 TWO SEA CHANGES IN NATURAL LANGUAGE PROCESSING

Fully supervised learning, where a task-specific model is trained solely on a dataset of input–output examples for the target task, has long played a central role in many machine learning tasks [60], and natural language processing (NLP) was no exception. Because such manually annotated datasets are ever-insufficient for learning high-quality models, early NLP models relied heavily on feature engineering (Table 1(a); e.g., Guyon et al. [39], Lafferty et al. [63], Och et al. [92], Zhang and Nivre [150]), where NLP researchers or engineers used their domain knowledge to define and extract salient features from raw data and provide models with the appropriate inductive bias to learn from this limited data. With the advent of neural network models for NLP, salient features were learned jointly with the training of the model itself [6, 16], and hence focus shifted to architecture engineering, where inductive bias was rather provided through the design of a suitable network architecture conducive to learning such features (Table 1(b); e.g., Bahdanau et al. [4], Chung et al. [15], Hochreiter and Schmidhuber [44], Kalchbrenner et al. [54], Kim [57], Vaswani et al. [137]).1


Table 1. Four Paradigms in NLP


However, from 2017 to 2019 there was a sea change in the learning of NLP models, and this fully supervised paradigm is now playing an ever-shrinking role. Specifically, the standard shifted to the pre-train and fine-tune paradigm (Table 1(c); e.g., Dong et al. [22], Lewis et al. [69], Peters et al. [97], Radford et al. [104], Yang et al. [143]). In this paradigm, a model with a fixed2 architecture is pre-trained as a language model (LM),3 predicting the probability of observed textual data. Because the raw textual data necessary to train LMs is available in abundance, these LMs can be trained on large datasets, in the process learning robust general-purpose features of the language it is modeling. The above pre-trained LM will be then adapted to different downstream tasks by introducing additional parameters and fine-tuning them using task-specific objective functions. Within this paradigm, the focus turned mainly to objective engineering, designing the training objectives used at both the pre-training and fine-tuning stages. For example, Zhang et al. [148] show that introducing a loss function of predicting salient sentences from a document will lead to a better pre-trained LM for text summarization. Notably, the main body of the pre-trained LM is generally (but not always; Peters et al. [98]) fine-tuned as well to make it more suitable for solving the downstream task.

Now, as of this writing in 2021, we are in the middle of a second sea change, in which the “pre-train, fine-tune” procedure is replaced by one in which we dub “pre-train, prompt, and predict.” In this paradigm, instead of adapting pre-trained LMs to downstream tasks via objective engineering, downstream tasks are reformulated to look more like those solved during the original LM training with the help of a textual prompt. For example, when recognizing the emotion of a social media post, “I missed the bus today,” we may continue with a prompt “I felt so ” and ask the LM to fill the blank with an emotion-bearing word. Or if we choose the prompt “English: I missed the bus today. French: ”), then an LM may be able to fill in the blank with a French translation. In this way, by selecting the appropriate prompts we can manipulate the model behavior so that the pre-trained LM itself can be used to predict the desired output, sometimes even without any additional task-specific training (Table 1(d); e.g., Brown et al. [9], Petroni et al. [100], Radford et al. [105], Schick and Schütze [120]). The advantage of this method is that, given a suite of appropriate prompts, a single LM trained in an entirely unsupervised fashion can be used to solve a great number of tasks [9, 131]. However, as with most conceptually enticing prospects, there is a catch — this method introduces the necessity for [[prompt engineering[[, finding the most appropriate prompt to allow a LM to solve the task at hand.

This survey attempts to organize the current state of knowledge in this rapidly developing field by providing an overview and formal definition of prompting methods (Section 2). This is followed by in-depth discussion of prompting methods from basics such as prompt template engineering (Section 3) and prompt answer engineering (Section 4) to more advanced concepts such as multi-prompt learning methods (Section 5) and prompt-aware training methods (Section 6). We then organize the various applications to which prompt-based learning methods have been applied and discuss how they interact with the choice of prompting method (Section 7). Finally, we attempt to situate the current state of prompting methods in the research ecosystem, making connections to other research fields (Section 8), suggesting some current challenging problems that may be ripe for further research (Section 9).

Finally, to help beginners who are interested in this field learn more effectively, we highlight some systematic resources about prompt learning (as well as pre-training) provided both within this survey and on companion websites:

A website of prompt-based learning that contains: frequent updates to this survey, related slides, and so on.

  • Figure 1: A typology of important concepts for prompt-based learning.
  • Tables 7 and 8: A systematic and comprehensive comparison among different prompting methods.
  • Table 5: An organization of commonly-used prompts.
  • Table 4: A timeline of prompt-based research works.
  • Table 1: A systematic and comprehensive comparison among different pre-trained LMs.

Fig. 1. Typology of prompting methods.

2 A FORMAL DESCRIPTION OF PROMPTING

2.1 Supervised Learning in NLP

...

6 TRAINING STRATEGIES FOR PROMPTING METHODS

With the methods in the above sections, it is now clear how to obtain an appropriate prompt (or prompts) and corresponding answers. Now we discuss about methods that explicitly train models in concert with prompting methods, as outlined in the “Training Strategies” section of Figure 1.

6.1 Training Settings

In many cases, prompting methods can be used without any explicit training of the LM for the downstream task, simply taking an LM that has been trained to predict the probability of text 𝑃(𝐱)

and applying it as-is to fill the cloze or prefix prompts defined to specify the task. This is traditionally called the zero-shot setting [111], as there is zero training data for the task of interest.

However, there are also methods that use training data to train the model in concert with prompting methods. These consist of either full-data learning, where a reasonably large number of training examples are used to train the model, or few-shot learning [126], where a very small number of examples are used to train the model. Prompting methods are particularly useful in the latter case [9, 32, 117], as there are generally not enough training examples to fully specify the desired behavior, and thus using a prompt to push the model in the right direction is particularly effective.

One thing to note is that for many of the prompt template engineering methods described in Section 3, although annotated training samples are not explicitly used in the training of the downstream task model, they are often used in the construction or validation of the prompts that the downstream task will use. As noted by Perez et al. [96], this is arguably not true zero-shot learning with respect to the downstream task.

6.2 Parameter Update Methods

In prompt-based downstream task learning, there are usually two types of parameters, namely those from (1) pre-trained LMs and (2) prompts. Which part of parameters should be updated is one important design decision, which can lead to different levels of applicability in different scenarios. We summarize five tuning strategies (as shown in Table 4) based on (i) whether the parameters of the underlying LM are tuned, (ii) whether there are additional prompt-related parameters, and (iii) if there are additional prompt-related parameters, whether those parameters are tuned.

Table 4.
Strategy	LM Params	Prompt Params	Example
Additional	Tuned
Promptless Fine-tuning	Tuned	—	ELMo [97], BERT [20], BART [69]
Tuning-free Prompting	Frozen	✗	✗	GPT-3 [9], AutoPrompt [125], LAMA [100]
Fixed-LM Prompt Tuning	Frozen	✓	Tuned	Prefix-Tuning [71], Prompt-Tuning [67]
Table 4. Characteristics of Different Tuning Strategies
“Additional” represents if there are additional parameters beyond LM parameters while “Tuned” denotes if parameters are updated.
Table 5. Other Research Topics Relevant to Prompting Methods
Prompt Concept Relevant Topic Commonality Peculiarity
Prompt Ensembling [52, 117] Ensemble Learning [133, 153] Combine results of multiple systems to get better performance In prompt ensembling, multiple predictions result from different prompt variants. This contrasts with architecture or feature variations, each of which requires separate training.
Prompt Augmentation [9, 32] Few-shot Learning [28, 127] Use few examples to learn generalized rules Prompt augmentation is a specific subset of few-shot learning.
Prompt Augmentation [9, 32] Larger-context Learning [11, 38] Introduce larger context to aid the learning process Additional information introduced in larger-context learning is not necessarily the labeled data.
Discrete Prompt Search [52, 125] Query reformulation [90, 90] Reformulate the input into a query form Query reformulation commonly focuses on information extraction and question answering tasks, while prompt learning can be applied to a variety of NLP tasks
Discrete Prompt Fine-tuning [32] QA-based multi-task learning [70, 83] Reformulate many tasks into an QA form QA-based formulations aim to solve different tasks through question answering, while prompting additionally targets full use of pre-trained LMs.
Continuous Prompt Fine-tuning [23, 77] Controlled Text Generation [56, 122, 146] Input is augmented with additional inputs to control the generation process Controlled generation targets generation of a particular type of text while prompt learning uses prompts to specify the task itself.
Prompt-based downstream task learning [117, 147] Supervised Attention [75, 130] Require external hint to remind the model of which part information should be focused on Research works on supervised attention usually target at salient information from an image or text, while prompt learning aims to utilize relevant knowledge from the pre-trained LM.
Prompt-based downstream task learning [117, 147] Data augmentation [26, 109] Improving downstream tasks’ performance by introducing additional samples Data augmentation introduces additional training samples in an explicit way while prompts can be regarded as highly-condensed training samples [65].
6.2.1 Promptless Fine-tuning.

As mentioned in the Introduction, the pre-train and fine-tune strategy has been widely used in NLP since before the popularization of prompting methods. Here we refer to pre-training and fine-tuning without prompts as promptless fine-tuning, to contrast with the prompt-based learning methods introduced in the following sections. In this strategy, given a dataset of a task, all (or some [46, 98]) of the parameters of the pre-trained LM will be updated via gradients induced from downstream training samples. Typical examples of pre-trained LMs tuned in this way include BERT [20] and RoBERTa [79]. This is a simple, powerful, and widely used method, but it may overfit or not learn stably on small datasets [21]. Models are also prone to catastrophic forgetting, where the LM loses its ability to do things that it was able to do before fine-tuning [84].

  • Advantages: Simplicity, no need for prompt design. Tuning all the LM parameters allows the model to fit to larger training datasets.
  • Disadvantages: LMs may overfit or not learn stably on smaller datasets.
6.2.2 Tuning-free Prompting.

Tuning-free prompting directly generates the answers without changing the parameters of the pre-trained LMs based only on a prompt, as described in the simplest incarnation of prompting in Section 2. These can be optionally augmenting input with answered prompts as described in Section 5.2, and this combination of tuning-free prompting and prompt augmentation is also referred to as in-context learning [9]. Typical examples of tuning-free prompting include LAMA [100] and GPT-3 [9].

6.2.3 Fixed-LM Prompt Tuning.

In the scenario where additional prompt-relevant parameters are introduced besides parameters of the pre-trained LMs, fixed-LM prompt tuning updates only the prompts’ parameters using the supervision signal obtained from the downstream training samples, while keeping the entire pre-trained LM unchanged. Typical examples are Prefix-Tuning [71] and Prompt-Tuning [67].

  • Advantages: Similarly to tuning-free prompting, it can retain knowledge in LMs and is suitable in few-shot scenarios. Often superior accuracy to tuning-free prompting.
  • Disadvantages: Not applicable in zero-shot scenarios. While effective in few-shot scenarios, representation power is limited in large-data settings. Prompt engineering through choice of hyperparameters or seed prompts is necessary. Prompts are usually not human-interpretable or manipulable.
6.2.4 Fixed-prompt LM Tuning.

Fixed-prompt LM tuning tunes the parameters of the LM, as in the standard pre-train and fine-tune paradigm, but additionally uses prompts with fixed parameters to specify the model behavior. This potentially leads to improvements, particularly in few-shot scenarios.

The most natural way to do so is to provide a discrete textual template that is applied to every training and test example. Typical examples include PET-TC [117], PET-Gen [118], and LM-BFF [32]. Logan IV et al. [48] more recently observe that the prompt engineering can be reduced by allowing for a combination of prompt answer engineering and partial LM fine-tuning. For example, they define a very simple template, null prompt, where the input and mask are directly concatenated “[X][Z]” without any template words, and find this achieves competitive accuracy.

  • Advantages: Template or answer engineering more completely specify the task, allowing for more efficient learning, particularly in few-shot scenarios.
  • Disadvantages: Template or answer engineering are still required, although perhaps not as much as without prompting. LMs fine-tuned on one downstream task may not be effective on another one.
6.2.5 Prompt+LM Tuning.

In this setting, there are prompt-relevant parameters, which can be fine-tuned together with the all or some of the parameters of the pre-trained LMs. Representative examples include PADA [5] and P-Tuning [77]. Notably, this setting is very similar to the standard pre-train and fine-tune paradigm, but the addition of the prompt can provide additional bootstrapping at the start of model training.

  • Advantages: This is the most expressive method, likely suitable for high-data settings.
  • Disadvantages: Requires training and storing all parameters of the models. May overfit to small datasets.

7 APPLICATIONS

In previous sections, we examined prompting methods from the point of view of the mechanism of the method itself. In this section, we rather organize prompting methods from the point of view of which applications they have been applied to. We list these applications in Tables 7 and 8 and summarize them in the following sections.

...

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 PreTrainPromptandPredictASystemGraham Neubig
Pengfei Liu
Weizhe Yuan
Jinlan Fu
Zhengbao Jiang
Hiroaki Hayashi
Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing2023