Tokenization

Overview

AI models don't read words like humans do. Tokenization converts text into a sequence of integers that represent common character patterns, subwords, or words.

Token Limits

LLMs have a 'context window' defined by the maximum number of tokens they can process at once. Efficient tokenization is key to maximizing this window.

Common Tokenizers

Byte Pair Encoding (BPE)
WordPiece
SentencePiece

Overview

Token Limits

Common Tokenizers

Related Terms