Byte-Pair Encoding (BPE) Algorithm

From GM-RKB
Jump to navigation Jump to search

A Byte-Pair Encoding (BPE) Algorithm is a dictionary encoding algorithm that can be implemented by a BPE-based tokenization system to solve a BPE-based tokenization task (to replace frequent byte-pairs with a single unused byte).



References

2021

  • (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Byte_pair_encoding Retrieved:2021-5-2.
    • Byte pair encoding[1] or digram coding is a simple form of data compression in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur within that data. A table of the replacements is required to rebuild the original data. The algorithm was first described publicly by Philip Gage in a February 1994 article "A New Algorithm for Data Compression" in the C Users Journal. A variant of the technique has shown to be useful in several natural language processing (NLP) applications, such as Google's SentencePiece, and OpenAI's GPT-3.
  1. Gage, Philip (1994). "A New Algorithm for Data Compression". The C User Journal.

2020

Algorithm 1 Byte-pair encoding (Sennrich et al., 2016; Gage, 1994)
1: Input: set of strings $D$, target vocab size $k$

2: procedure BPE$\left(D, k\right)$

3:   $V \leftarrow$ all unique characters in $D$ (about 4,000 in English Wikipedia)

4:   while $\vert V \vert < k$ do ⊳ Merge tokens

5:     $t_L, t_R \leftarrow$ Most frequent bigram in $D$

6:     $t_{NEW} \leftarrow t_L + t_R$ ⊳ Make new token

7:     $V \leftarrow V + [t_{NEW}]$

8:      Replace each occurrence of $t_L$, $t_R$ in $D$ with $t_{NEW}$

9:   end while

10:    return $V$

11:    end procedure

 :: BPE tokenization takes the vocabulary $V$ containing ordered merges and applies them to new text in the same order as they occurred during vocabulary construction.

2020

2016

2016 NeuralMachineTranslationofRareW Fig1.png
Figure 1: BPE merge operations learned from dictionary {‘low’, ‘lowest’, ‘newer’, ‘wider’}.
  1. The only symbols that will be unknown at test time are unknown characters, or symbols of which all occurrences in the training text have been merged into larger symbols, like "safeguar", which has all occurrences in our training text merged into "safeguard". We observed no such symbols at test time, but the issue could be easily solved by recursively reversing specific merges until all symbols are known.

2017

  • (Heinzerling & Strube, 2017) ⇒ Benjamin Heinzerling, and Michael Strube. (2017). “BPEmb: Tokenization-free Pre-trained Subword Embeddings in 275 Languages.” In: arXiv preprint arXiv:1710.02187.
    • QUOTE: ... Byte Pair Encoding is a variable-length encoding that views text as a sequence of symbols and iteratively merges the most frequent symbol pair into a new symbol. E.g., encoding an English text might consist of first merging the most frequent symbol pair $t h$ into a new symbol $th$, then merging the pair $th e$ into the in the next iteration, and so on. The number of merge operations o determines if the resulting encoding mostly creates short character sequences (e.g. $o = 1000$) or if it includes symbols for many frequently occurring words, e.g. o = 30; 000 (cf. Table 1). Since the BPE algorithm works with any sequence of symbols, it requires no preprocessing and can be applied to untokenized text. …

1994