Stochastic Gradient Descent

Overview

SGD is the workhorse of neural network training. Unlike standard Gradient Descent, which uses the entire dataset for each update, SGD uses a single random example or a small 'batch.'

Why use SGD?

Efficiency: Much faster for large datasets.
Escaping Local Minima: The randomness can help the model jump out of poor local solutions.

Variants

Mini-batch SGD: Uses a small group of examples (most common).
Adam: An adaptive version that adjusts the learning rate for each parameter.

Overview

Why use SGD?

Variants

Related Terms