Overview

BPE is the most common tokenization method for modern LLMs (including GPT models). it strikes a balance between character-level and word-level tokenization.

Advantages

  • Handles Out-of-Vocabulary (OOV) words: Can represent any word by breaking it into smaller sub-units.
  • Efficiency: Common words become single tokens, while rare words are split into meaningful pieces (e.g., 'unhappy' -> 'un' + 'happy').

Process

It starts with individual characters and keeps merging the most frequent adjacent pairs until a desired vocabulary size is reached.

Related Terms