Overview

AI models don't read words like humans do. Tokenization converts text into a sequence of integers that represent common character patterns, subwords, or words.

Token Limits

LLMs have a 'context window' defined by the maximum number of tokens they can process at once. Efficient tokenization is key to maximizing this window.

Common Tokenizers

  • Byte Pair Encoding (BPE)
  • WordPiece
  • SentencePiece

Related Terms