Overview
BPE is the most common tokenization method for modern LLMs (including GPT models). it strikes a balance between character-level and word-level tokenization.
Advantages
- Handles Out-of-Vocabulary (OOV) words: Can represent any word by breaking it into smaller sub-units.
- Efficiency: Common words become single tokens, while rare words are split into meaningful pieces (e.g., 'unhappy' -> 'un' + 'happy').
Process
It starts with individual characters and keeps merging the most frequent adjacent pairs until a desired vocabulary size is reached.