Overview

Developed by Google, SentencePiece is used by models like T5 and Llama. Unlike BPE or WordPiece, it doesn't require a pre-tokenization step (like splitting by spaces), making it better for languages like Chinese or Japanese.

Key Innovation

It treats the space character as a normal symbol (often represented as an underscore '_'). This ensures that the original text can be perfectly reconstructed from the tokens.

Benefits

  • Truly multilingual support.
  • No need for complex language-specific rules.
  • Highly efficient and robust.

Related Terms