Byte Pair Encoding

Overview

BPE is the most common tokenization method for modern LLMs (including GPT models). it strikes a balance between character-level and word-level tokenization.

Advantages

Handles Out-of-Vocabulary (OOV) words: Can represent any word by breaking it into smaller sub-units.
Efficiency: Common words become single tokens, while rare words are split into meaningful pieces (e.g., 'unhappy' -> 'un' + 'happy').

Process

It starts with individual characters and keeps merging the most frequent adjacent pairs until a desired vocabulary size is reached.

Overview

Advantages

Process

Related Terms