2020 LanguageModelsAreFewShotLearner

From GM-RKB
Jump to navigation Jump to search

Subject Headings: GPT-3, GPT Model, Few-Shot Learning, In-Context Learning.

Notes

Cited By

2023

  • chat
    • In the “Language Models are Few-Shot Learners” (LMaFSL) paper, the authors evaluated GPT-3 on a range of natural language processing tasks, which included:
      • Zero Shot Language Modeling: GPT-3 was evaluated on its ability to predict the next word in a sequence, using the LAMBADA dataset.
      • One Shot Translation: The model was tested on English-German, English-French, and English-Romanian translation tasks, using the WMT14 and WMT16 datasets.
      • Zero Shot Summarization: GPT-3's summarization abilities were evaluated using the CNN/Daily Mail dataset.
      • One Shot Sentiment Analysis: The model was tested on sentiment classification using the Stanford Sentiment Treebank and the IMDb datasets.
      • Zero Shot Question Answering: GPT-3 was evaluated on its ability to answer questions using the Natural Questions and the LAMBADA datasets.
      • Commonsense Reasoning: The model's performance in commonsense reasoning was assessed using the Winograd Schema Challenge and the COPA dataset.
      • Reading Comprehension: GPT-3 was tested on the SuperGLUE benchmark, which includes various subtasks like BoolQ, MultiRC, ReCoRD, and WiC.
      • Text Completion: The model was evaluated on its ability to complete sentences using the LAMBADA dataset.
      • Natural Language Inference: GPT-3 was tested on the Multi-Genre Natural Language Inference (MNLI) and the Recognizing Textual Entailments (RTE) tasks.

Quotes

Abstract

Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions - something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model. GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic. At the same time, we also identify some datasets where GPT-3's few-shot learning still struggles, as well as some datasets where GPT-3 faces methodological issues related to training on large web corpora. Finally, we find that GPT-3 can generate samples of news articles which human evaluators have difficulty distinguishing from articles written by humans. We discuss broader societal impacts of this finding and of GPT-3 in general.

2 Approach

Our basic pre-training approach, including model, data, and training, is similar to the process described in [RWC+19], with relatively straightforward scaling up of the model size, dataset size and diversity, and length of training. Our use of in-context learning is also similar to [RWC+19], but in this work we systematically explore different settings for learning within the context. Therefore, we start this section by explicitly defining and contrasting the different settings that we will be evaluating GPT-3 on or could in principle evaluate GPT-3 on. These settings can be seen as lying on a spectrum of how much task-specific data they tend to rely on. Specifically, we can identify at least four points on this spectrum (see Figure 2.1 for an illustration):

• Fine-Tuning (FT) has been the most common approach in recent years, and involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task. Typically thousands to hundreds of thousands of labeled examples are used. The main advantage of fine-tuning is strong performance on many benchmarks. The main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution [MPL19], and the potential to exploit spurious features of the training data [GSL+18, NK19], potentially resulting in an unfair comparison with human performance. In this work we do not fine-tune GPT-3 because our focus is on task-agnostic performance, but GPT-3 can be fine-tuned in principle and this is a promising direction for future work.

• Few-Shot (FS) is the term we will use in this work to refer to the setting where the model is given a few demonstrations of the task at inference time as conditioning [RWC+19], but no weight updates are allowed. As shown in Figure 2.1, for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving K examples of context and completion, and then one final example of context, with the model expected to provide the completion. We typically set K in the range of 10 to 100 as this is how many examples can fit in the model’s context window (nctx = 2048). The main advantages of few-shot are a major reduction in the need for task-specific data and reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task specific data is still required. As indicated by the name, few-shot learning as described here for language models is related to few-shot learning as used in other contexts in ML [HYC01, VBL+16] – both involve learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task.

• One-Shot (1S) is the same as few-shot except that only one demonstration is allowed, in addition to a natural language description of the task, as shown in Figure 1. The reason to distinguish one-shot from few-shot and zero-shot (below) is that it most closely matches the way in which some tasks are communicated to humans. For example, when asking humans to generate a dataset on a human worker service (for example Mechanical Turk), it is common to give one demonstration of the task. By By contrast it is sometimes difficult to communicate the content or format of a task if no examples are given.

Figure 2.1: Zero-shot, one-shot and few-shot, contrasted with traditional fine-tuning. The panels above show four methods for performing a task with a language model – fine-tuning is the traditional method, whereas zero-, one-, and few-shot, which we study in this work, require the model to perform the task with only forward passes at test time. We typically present the model with a few dozen examples in the few shot setting. Exact phrasings for all task descriptions, examples and prompts can be found in Appendix G.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2020 LanguageModelsAreFewShotLearnerIlya Sutskever
Arvind Neelakantan
Alec Radford
Jeffrey Wu
Rewon Child
Dario Amodei
Tom B. Brown
Benjamin Mann
Nick Ryder
Melanie Subbiah
Jared Kaplan
Prafulla Dhariwal
Pranav Shyam
Girish Sastry
Amanda Askell
Sandhini Agarwal
Ariel Herbert-Voss
Gretchen Krueger
Tom Henighan
Aditya Ramesh
Daniel M. Ziegler
Clemens Winter
Christopher Hesse
Mark Chen
Eric Sigler
Mateusz Litwin
Scott Gray
Benjamin Chess
Jack Clark
Christopher Berner
Sam McCandlish
Language Models Are Few-Shot Learners2020