Stochastic Optimization
Stochastic optimization minimizes an objective using noisy estimates of its gradient or value, updating parameters from random subsets of data or random perturbations rather than the full, exact objective.
Definition
Stochastic optimization is a family of iterative methods that update parameter estimates using random, unbiased estimates of an objective or its gradient, enabling optimization when the full objective is too expensive to evaluate or is only observed with noise.
Scope
This topic covers stochastic approximation in the Robbins-Monro tradition, stochastic gradient descent and its mini-batch and momentum variants, the step-size (learning-rate) schedules that control convergence, the trade-off between noise and computational cost, and convergence guarantees. Its role in fitting large-scale statistical and machine-learning models is emphasized.
Core questions
- How can noisy gradient estimates drive convergence to an optimum?
- What step-size schedules guarantee convergence in the Robbins-Monro framework?
- How does mini-batching trade noise against computational cost per step?
- Why is stochastic optimization essential for very large data sets?
Key concepts
- Stochastic approximation
- Mini-batch gradient
- Learning-rate schedule
- Unbiased gradient estimate
- Step-size decay
- Almost-sure convergence
Key theories
- Stochastic approximation
- The Robbins-Monro scheme finds the root of an unknown function from noisy measurements by taking small steps whose sizes decrease at a prescribed rate, converging almost surely under conditions on the step-size sequence.
- Stochastic gradient methods
- Replacing the full gradient by an unbiased estimate from a random data subset yields cheap updates whose averaged trajectory descends the objective, with learning-rate schedules balancing convergence speed against the variance of the noise.
Clinical relevance
Stochastic gradient methods make it possible to fit models to data sets too large to process at once, and they are the dominant optimization strategy for training neural networks and large-scale regression, where computing the full gradient every step would be prohibitive.
History
Robbins and Monro introduced stochastic approximation in 1951 to find roots from noisy observations, and Kiefer and Wolfowitz adapted it to optimization soon after; the explosion of large-scale machine learning revived these ideas as stochastic gradient descent and its many modern variants.
Key figures
- Herbert Robbins
- Sutton Monro
- Harold Kushner
- Jack Kiefer
Related topics
Seminal works
- robbins1951
- kushner2003
Frequently asked questions
- Why use noisy gradients instead of the exact gradient?
- Computing the exact gradient over millions of data points is expensive. A gradient estimated from a small random batch is far cheaper and, although noisy, still points downhill on average, so many cheap noisy steps can beat a few exact ones.
- Why does the step size usually shrink over time?
- Decreasing the step size damps the gradient noise as the iterates approach the optimum, which is what the Robbins-Monro conditions require for convergence. A step size that stays too large makes the estimate bounce around the solution.