Byte-Pair Encoding (BPE) Tokenization System

From GM-RKB
Jump to navigation Jump to search

A Byte-Pair Encoding (BPE) Tokenization System is a dictionary encoding system that implements a BPE-based tokenization algorithm to solve a BPE-based tokenization task (to replace frequent byte-pairs with a single unused byte).



References

2022

  • https://dugas.ch/artificial_curiosity/GPT_architecture.html
    • QUOTE:
      • Note: For efficiency, GPT-3 actually uses byte-level Byte Pair Encoding (BPE) tokenization. What this means is that "words" in the vocabulary are not full words, but groups of characters (for byte-level BPE, bytes) which occur often in text. Using the GPT-3 Byte-level BPE tokenizer, "Not all heroes wear capes" is split into tokens "Not" "all" "heroes" "wear" "cap" "es", which have ids 3673, 477, 10281, 5806, 1451, 274 in the vocabulary. Here is a very good introduction to the subject, and a github implementation so you can try it yourself.
      • 2022 edit: OpenAI now has a tokenizer tool, which allows you to type some text and see how it gets broken down into tokens. [1] ...