Overview
WordPiece was developed by Google and is the tokenizer used for BERT. Like BPE, it breaks words into sub-units to handle a wide variety of text efficiently.
Key Feature
It uses a '##' prefix to indicate that a subword is a continuation of a previous token (e.g., 'playing' might be tokenized as ['play', '##ing']).
Comparison with BPE
While BPE is purely frequency-based, WordPiece uses a more complex statistical model to decide which subwords are most informative.