SentencePiece

Overview

Developed by Google, SentencePiece is used by models like T5 and Llama. Unlike BPE or WordPiece, it doesn't require a pre-tokenization step (like splitting by spaces), making it better for languages like Chinese or Japanese.

Key Innovation

It treats the space character as a normal symbol (often represented as an underscore '_'). This ensures that the original text can be perfectly reconstructed from the tokens.

Benefits

Truly multilingual support.
No need for complex language-specific rules.
Highly efficient and robust.

Overview

Key Innovation

Benefits

Related Terms