Overview
AI models don't read words like humans do. Tokenization converts text into a sequence of integers that represent common character patterns, subwords, or words.
Token Limits
LLMs have a 'context window' defined by the maximum number of tokens they can process at once. Efficient tokenization is key to maximizing this window.
Common Tokenizers
- Byte Pair Encoding (BPE)
- WordPiece
- SentencePiece