Multi-head Attention

Overview

Instead of performing a single attention calculation, multi-head attention runs multiple 'heads' in parallel. Each head can learn to focus on different types of relationships (e.g., one head for grammar, another for semantic meaning).

Benefits

Richness: Captures a more diverse set of relationships within the data.
Stability: Averaging or concatenating the results from multiple heads makes the model more robust.

Implementation

This is a core part of the Transformer architecture used in models like GPT and BERT.

Overview

Benefits

Implementation

Related Terms