Overview
Weight decay is a form of L2 Regularization. By penalizing large weights, it forces the model to keep the weights small, which leads to a simpler, smoother model that is less likely to overfit.
Why it Works
Large weights often indicate that the model is relying too heavily on specific features or noise in the training data. Small weights encourage the model to distribute its 'attention' more evenly.