Overview

SGD is the workhorse of neural network training. Unlike standard Gradient Descent, which uses the entire dataset for each update, SGD uses a single random example or a small 'batch.'

Why use SGD?

  • Efficiency: Much faster for large datasets.
  • Escaping Local Minima: The randomness can help the model jump out of poor local solutions.

Variants

  • Mini-batch SGD: Uses a small group of examples (most common).
  • Adam: An adaptive version that adjusts the learning rate for each parameter.

Related Terms