Overview

Because Transformers process all tokens in a sequence simultaneously (unlike RNNs), they have no inherent sense of word order. Positional encoding adds a unique signal to each token's embedding to indicate its position.

How it Works

Usually, sine and cosine functions of different frequencies are used to generate a vector that is added to the word embeddings. This allows the model to learn the distance between words.

Importance

Without positional encoding, the model would treat 'The dog bit the man' and 'The man bit the dog' as identical sets of words.