Overview
Self-attention (or intra-attention) allows each word in a sentence to 'attend' to every other word, helping the model understand the context and relationships within the sequence.
Example
In the sentence 'The animal didn't cross the street because it was too tired,' self-attention helps the model realize that 'it' refers to the 'animal,' not the 'street.'
Components
It uses three vectors for each input: Query, Key, and Value. The attention score is calculated by comparing the Query of one word with the Keys of all other words.