2020 BPEDropoutSimpleandEffectiveSub

From GM-RKB
Jump to navigation Jump to search

Subject Headings: Byte Pair Encoding (BPE); BPE-Dropout.

Notes

Cited By

Quotes

Abstract

Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens. While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors. So far, the only way to overcome this BPE imperfection, its deterministic nature, was to create another subword segmentation algorithm sentencepiece. In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. We introduce BPE-dropout - simple and effective subword regularization method based on and compatible with conventional BPE. It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework. Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 3 BLEU compared to BPE and up to 0.9 BLEU compared to the previous subword regularization.

1 Introduction

Using subword segmentation has become de-facto standard in Neural Machine Translation (bojar-etal-2018-findings; barrault-etal-2019-findings). Byte Pair Encoding (BPE) (sennrich-etal-2016-neural) is the dominant approach to subword segmentation. It keeps the common words intact while splitting the rare and unknown ones into a sequence of subword units. This potentially allows a model to make use of morphology, word composition and transliteration. BPE effectively deals with an open-vocabulary problem and is widely used due to its simplicity.

(a) (b)

Figure 1: Segmentation process of the word ‘unrelated’ using (a) BPE, (b) BPE-dropout. Hyphens indicate possible merges (merges which are present in the merge table); merges performed at each iteration are shown in green, dropped – in red.

There is, however, a drawback of BPE in its deterministic nature: it splits words into unique subword sequences, which means that for each word a model observes only one segmentation. Thus, a model is likely not to reach its full potential in exploiting morphology, learning the compositionality of words and being robust to segmentation errors. Moreover, as we will show further, subwords into which rare words are segmented end up poorly understood.

A natural way to handle this problem is to enable multiple segmentation candidates. This was initially proposed by sentencepiece as a subword regularization – a regularization method, which is implemented as an on-the-fly data sampling and is not specific to NMT architecture. Since standard BPE produces single segmentation, to realize this regularization the author had to propose a new subword segmentation, different from BPE. However, the introduced approach is rather complicated: it requires training a separate segmentation unigram language model, using EM and Viterbi algorithms, and forbids using conventional BPE.

In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. BPE builds a vocabulary of subwords and a merge table, which specifies which subwords have to be merged into a bigger subword, as well as the priority of the merges. During segmentation, words are first split into sequences of characters, then the learned merge operations are applied to merge the characters into larger, known symbols, till no merge can be done (Figure 1(a)). We introduce BPE-dropout – a subword regularization method based on and compatible with conventional BPE. It uses a vocabulary and a merge table built by BPE, but at each merge step, some merges are randomly dropped. This results in different segmentations for the same word (Figure 1(b)). Our method requires no segmentation training in addition to BPE and uses standard BPE at test time, therefore is simple. BPE-dropout is superior compared to both BPE and sentencepiece on a wide range of translation tasks, therefore is effective.

Our key contributions are as follows:

  • We introduce BPE-dropout – a simple and effective subword regularization method;
  • We show that our method outperforms both BPE and previous subword regularization on a wide range of translation tasks;
  • We analyze how training with BPE-dropout affects a model and show that it leads to a better quality of learned token embeddings and to a model being more robust to noisy input.

2 Background

In this section, we briefly describe BPE and the concept of subword regularization. We assume that our task is machine translation, where a model needs to predict the target sentence Y given the source sentence X, but the methods we describe are not task-specific.

2.1 Byte Pair Encoding (BPE)

To define a segmentation procedure, BPE sennrich-etal-2016-neural builds a token vocabulary and a merge table. The token vocabulary is initialized with the character vocabulary, and the merge table is initialized with an empty table. First, each word is represented as a sequence of tokens plus a special end of word symbol. Then, the method iteratively counts all pairs of tokens and merges the most frequent pair into a new token. This token is added to the vocabulary, and the merge operation is added to the merge table. This is done until the desired vocabulary size is reached.

The resulting merge table specifies which subwords have to be merged into a bigger subword, as well as the priority of the merges. In this way, it defines the segmentation procedure. First, a word is split into distinct characters plus the end of word symbol. Then, the pair of adjacent tokens which has the highest priority is merged. This is done iteratively until no merge from the table is available (Figure 1(a)).

2.2 Subword regularization

Subword regularization sentencepiece is a training algorithm which integrates multiple segmentation candidates. Instead of maximizing log-likelihood, this algorithm maximizes log-likelihood marginalized over different segmentation candidates. Formally,

L=∑(X,Y)∈DEx∼P(x|X)y∼P(y|Y)logP(y|x,θ), (1)

where x and y are sampled segmentation candidates for sentences X and Y respectively, P(x|X) and P(y|Y) are the probability distributions the candidates are sampled from, and θ is the set of model parameters. In practice, at each training step only one segmentation candidate is sampled.

Since standard BPE segmentation is deterministic, to realize this regularization sentencepiece proposed a new subword segmentation. The introduced approach requires training a separate segmentation unigram language model to predict the probability of each subword, EM algorithm to optimize the vocabulary, and Viterbi algorithm to make samples of segmentations.

Subword regularization was shown to achieve significant improvements over the method using a single subword sequence. However, the proposed method is rather complicated and forbids using conventional BPE. This may prevent practitioners from using subword regularization.

3 Our Approach: BPE-Dropout

We show that to realize subword regularization it is not necessary to reject BPE since multiple segmentation candidates can be generated within the BPE framework. We introduce BPE-dropout – a method which exploits the innate ability of BPE to be stochastic. It alters the segmentation procedure while keeping the original BPE merge table. During segmentation, at each merge step some merges are randomly dropped with the probability p.

References

;

 AuthorvolumeDate ValuetitletypejournaltitleUrldoinoteyear
2020 BPEDropoutSimpleandEffectiveSubElena Voita
Ivan Provilkov
Dmitrii Emelianenko
BPE-Dropout: Simple and Effective Subword Regularization