WordPiece

Overview

WordPiece was developed by Google and is the tokenizer used for BERT. Like BPE, it breaks words into sub-units to handle a wide variety of text efficiently.

Key Feature

It uses a '##' prefix to indicate that a subword is a continuation of a previous token (e.g., 'playing' might be tokenized as ['play', '##ing']).

Comparison with BPE

While BPE is purely frequency-based, WordPiece uses a more complex statistical model to decide which subwords are most informative.

Overview

Key Feature

Comparison with BPE

Related Terms