Overview
In very deep networks, the gradient is multiplied many times as it flows back through the layers. If the gradients are small (as with Sigmoid or Tanh functions), they can shrink to near zero, effectively stopping the early layers from learning.
Solutions
- ReLU Activation: Doesn't saturate for positive values.
- Batch Normalization: Keeps inputs in a healthy range.
- Residual Connections (ResNets): Allow gradients to 'skip' layers.
- LSTMs/GRUs: Use gates to maintain information flow over time.