Overview
Instead of performing a single attention calculation, multi-head attention runs multiple 'heads' in parallel. Each head can learn to focus on different types of relationships (e.g., one head for grammar, another for semantic meaning).
Benefits
- Richness: Captures a more diverse set of relationships within the data.
- Stability: Averaging or concatenating the results from multiple heads makes the model more robust.
Implementation
This is a core part of the Transformer architecture used in models like GPT and BERT.