Overview
SGD is the workhorse of neural network training. Unlike standard Gradient Descent, which uses the entire dataset for each update, SGD uses a single random example or a small 'batch.'
Why use SGD?
- Efficiency: Much faster for large datasets.
- Escaping Local Minima: The randomness can help the model jump out of poor local solutions.
Variants
- Mini-batch SGD: Uses a small group of examples (most common).
- Adam: An adaptive version that adjusts the learning rate for each parameter.