Overview

Instead of performing a single attention calculation, multi-head attention runs multiple 'heads' in parallel. Each head can learn to focus on different types of relationships (e.g., one head for grammar, another for semantic meaning).

Benefits

  • Richness: Captures a more diverse set of relationships within the data.
  • Stability: Averaging or concatenating the results from multiple heads makes the model more robust.

Implementation

This is a core part of the Transformer architecture used in models like GPT and BERT.

Related Terms