Byte-Pair Encoding (BPE) Tokenization System

Context:
- …
Example(s):
- https://beta.openai.com/tokenizer.
- …
Counter-Example(s):
- …
See: Subword Unit, Lossless Compression Algorithm, Subword Tokenization Algorithm, Word Segmentation Algorithm, Entropy Encoding Algorithm, Data Compression Algorithm, Context Tree Weighting (CTW) Algorithm.

References

https://dugas.ch/artificial_curiosity/GPT_architecture.html
- QUOTE:
  - Note: For efficiency, GPT-3 actually uses byte-level Byte Pair Encoding (BPE) tokenization. What this means is that "words" in the vocabulary are not full words, but groups of characters (for byte-level BPE, bytes) which occur often in text. Using the GPT-3 Byte-level BPE tokenizer, "Not all heroes wear capes" is split into tokens "Not" "all" "heroes" "wear" "cap" "es", which have ids 3673, 477, 10281, 5806, 1451, 274 in the vocabulary. Here is a very good introduction to the subject, and a github implementation so you can try it yourself.
  - 2022 edit: OpenAI now has a tokenizer tool, which allows you to type some text and see how it gets broken down into tokens. [1] ...