2023 RecentAdvancesinNaturalLanguage

From GM-RKB
Jump to navigation Jump to search

Subject Headings:

Notes

Cited By

Quotes

Abstract

Large, pre-trained language models (PLMs) such as BERT and GPT have drastically changed the Natural Language Processing (NLP) field. For numerous NLP tasks, approaches leveraging PLMs have achieved state-of-the-art performance. The key idea is to learn a generic, latent representation of language from a generic task once, then share it across disparate NLP tasks. Language modeling serves as the generic task, one with abundant self-supervised text available for extensive training. This article presents the key fundamental concepts of PLM architectures and a comprehensive view of the shift to PLM-driven NLP techniques. It surveys work applying the pre-training then fine-tuning, prompting, and text generation approaches. In addition, it discusses PLM limitations and suggested directions for future research.

1. Introduction

  • NOTE: This section introduces the significant impact of large pre-trained transformer-based language models on NLP, highlighting their role in creating a paradigm shift within the field.

2. Paradigm 1: Pre-Train then Fine-Tune

  • NOTE: The paper discusses the methodology of pre-training language models on large datasets and then fine-tuning them for specific NLP tasks. It includes sub-sections on the beginnings of this paradigm shift, modern pre-trained language models, pre-training corpora, and different fine-tuning approaches.

2.1 The Beginnings of the Paradigm Shift

While pre-training in machine learning and, in particular, computer vision has been studied since at least 2010 (Erhan et al., 2010; Yosinski et al., 2014; Huh et al., 2016), the technique did not gain traction in NLP until later in the decade, with the publication of Vaswani et al. (2017). The delay in uptake is partly due to the later arrival of deep neural models to NLP compared to computer vision, partly due to the difficulty of choosing a self-supervised task suitable for pre-training, and above all, due to the need for drastically larger model sizes and corpora in order to be effective for NLP tasks. We explore these aspects further in the discussion below.

The idea of pre-training on a language modeling task is quite old. Collobert and Weston (2008) first suggested pre-training a model on a number of tasks to learn features instead of hand-crafting them (the predominant approach at the time). Their version of language model pre-training, however, differed significantly from the methods we see today. They used language modeling as only one of many tasks in a multitask learning setting, along with other supervised tasks such as part-of-speech (POS) tagging, named entity recognition (NER) and semantic role labeling (SRL). Collobert and Weston proposed sharing the weights of their deepest convolutional layer – the word embeddings learned by the model – between the multiple training tasks and fine-tuning the weights of the two remaining two feed-forward layers for each individual task.

  • 1 The exact formulation varies from the classic unidirectional language modeling (next word prediction) to cloze-style fill-in-the-blank, uncorrupting spans, and other variants (see Section 2.3).
  • 2 In self-supervised learning, the ground truth (e.g. the missing word) comes from the unlabeled text itself. This allows the pre-training to scale up with the near-infinite amount of text available on the web.
Figure 1: Three types of pre-trained language models. Model architecture illustrations are from Lewis et al. (2020). For the encoder-decoder model, the corruption strategy of document rotation is shown. Alternatives include sentence permutation, text infilling, token deletion/masking, etc.

Pre-training and fine-tuning did not gain popularity in NLP until the advent of ELMo (Peters et al., 2018) and ULMFiT (Howard and Ruder, 2018). Both models are based on Long Short-Term Memory architecture (LSTMs) (Hochreiter and Schmidhuber, 1997), but differ in significant ways. ULMFiT pre-trains a three-layer LSTM on a standard language modeling objective, predicting the next token in a sequence. ELMo uses layers of bidirectional LSTMs that combine two language model tasks in forward and backward directions to capture context from both sides. Both proposed fine-tuning the language model layer by layer for downstream application. Both studies also suggested adding additional classifier layers on top of the language model, which were fine-tuned alongside the language model layers. These changes, combined with the substantially larger model size and pre-training corpus size compared to previous models, allowed the pre-training then fine-tuning paradigm to succeed. Both ELMo and ULMFiT showed competitive or improved performance compared to the then-state-of-the-art for a number of tasks, demonstrating the value of language model pre-training on a large scale.

The pace of this paradigm shift picked up dramatically in late 2018 when Vaswani et al. (2017) introduced the Transformer architecture that can be used for language model pre-training. The Transformer’s multi-head self-attention mechanism allows every word to attend to all previous words or every word except the target, allowing the model to efficiently capture long-range dependencies without the expensive recurrent computation in LSTMs. Multiple layers of multi-head self-attention allow for increasingly more expressive representations, useful for a range of NLP problems. As a result, nearly all popular language models, including GPT, BERT, BART (Lewis et al., 2020) and T5 (Raffel et al., 2020), are now based on the Transformer architecture. They also differ in a number of important ways, which we discuss in the following sections. For more details about the Transformer architecture, we refer the reader to the original paper or to the excellent tutorials available.

2.2 Modern Pre-Trained Language Models

  • NOTE: Discusses different classes of pre-trained language models like autoregressive models, masked language models, and encoder-decoder models.

2.3 Pre-Training Corpora

  • NOTE: Explores the types and sizes of corpora used for pre-training various language models.

2.4 Fine-Tuning: Applying PLMs to NLP Tasks

  • NOTE: This section describes how pre-trained language models are adapted or fine-tuned for specific NLP tasks.

3. Paradigm 2: Prompt-based Learning

  • NOTE: Focuses on a different paradigm where a pre-trained language model is prompted in a way that the desired NLP task is reformulated into a task similar to the model's pre-training task.

4. Paradigm 3: NLP as Text Generation

  • NOTE: Discusses the approach of framing NLP tasks as text generation problems, leveraging the capabilities of generative language models.

5. Data Generation

  • NOTE: Covers how pre-trained language models are used to generate data for training augmentation and other NLP tasks.

6. Limitations and Future Directions

  • NOTE: Discusses the current limitations of approaches using pre-trained language models and suggests potential areas for future research.

7. Conclusion

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2023 RecentAdvancesinNaturalLanguageEneko Agirre
Dan Roth
Bonan Min
Hayley Ross
Elior Sulem
Amir Pouran Ben Veyseh
Thien Huu Nguyen
Oscar Sainz
Ilana Heintz
Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey2023