Overview

Momentum helps the optimizer 'roll' over small bumps (local minima) and speed up in directions where the gradient is consistent. It's inspired by the physical concept of a ball rolling down a hill.

How it Works

Instead of just using the current gradient, the update is a weighted average of the current gradient and the previous update. This 'velocity' helps the model converge faster and with less oscillation.

Related Terms