Overview
Developed by Google, SentencePiece is used by models like T5 and Llama. Unlike BPE or WordPiece, it doesn't require a pre-tokenization step (like splitting by spaces), making it better for languages like Chinese or Japanese.
Key Innovation
It treats the space character as a normal symbol (often represented as an underscore '_'). This ensures that the original text can be perfectly reconstructed from the tokens.
Benefits
- Truly multilingual support.
- No need for complex language-specific rules.
- Highly efficient and robust.