Why use noisy gradients instead of the exact gradient?

Computing the exact gradient over millions of data points is expensive. A gradient estimated from a small random batch is far cheaper and, although noisy, still points downhill on average, so many cheap noisy steps can beat a few exact ones.

Why does the step size usually shrink over time?

Decreasing the step size damps the gradient noise as the iterates approach the optimum, which is what the Robbins-Monro conditions require for convergence. A step size that stays too large makes the estimate bounce around the solution.

Stochastic Optimization

Stochastic optimization minimizes an objective using noisy estimates of its gradient or value, updating parameters from random subsets of data or random perturbations rather than the full, exact objective.

PaperMind से विषय खोजेंजल्द हीFind papers & topics

Tools & resources

स्लाइड डाउनलोड करें

Learn & explore

वीडियोजल्द ही

Definition

Stochastic optimization is a family of iterative methods that update parameter estimates using random, unbiased estimates of an objective or its gradient, enabling optimization when the full objective is too expensive to evaluate or is only observed with noise.

Scope

This topic covers stochastic approximation in the Robbins-Monro tradition, stochastic gradient descent and its mini-batch and momentum variants, the step-size (learning-rate) schedules that control convergence, the trade-off between noise and computational cost, and convergence guarantees. Its role in fitting large-scale statistical and machine-learning models is emphasized.

Core questions

How can noisy gradient estimates drive convergence to an optimum?
What step-size schedules guarantee convergence in the Robbins-Monro framework?
How does mini-batching trade noise against computational cost per step?
Why is stochastic optimization essential for very large data sets?

Key concepts

Stochastic approximation
Mini-batch gradient
Learning-rate schedule
Unbiased gradient estimate
Step-size decay
Almost-sure convergence

Key theories

Stochastic approximation: The Robbins-Monro scheme finds the root of an unknown function from noisy measurements by taking small steps whose sizes decrease at a prescribed rate, converging almost surely under conditions on the step-size sequence.
Stochastic gradient methods: Replacing the full gradient by an unbiased estimate from a random data subset yields cheap updates whose averaged trajectory descends the objective, with learning-rate schedules balancing convergence speed against the variance of the noise.

Clinical relevance

Stochastic gradient methods make it possible to fit models to data sets too large to process at once, and they are the dominant optimization strategy for training neural networks and large-scale regression, where computing the full gradient every step would be prohibitive.

History

Robbins and Monro introduced stochastic approximation in 1951 to find roots from noisy observations, and Kiefer and Wolfowitz adapted it to optimization soon after; the explosion of large-scale machine learning revived these ideas as stochastic gradient descent and its many modern variants.

Key figures

Herbert Robbins
Sutton Monro
Harold Kushner
Jack Kiefer

Seminal works

robbins1951
kushner2003

Frequently asked questions

Why use noisy gradients instead of the exact gradient?: Computing the exact gradient over millions of data points is expensive. A gradient estimated from a small random batch is far cheaper and, although noisy, still points downhill on average, so many cheap noisy steps can beat a few exact ones.
Why does the step size usually shrink over time?: Decreasing the step size damps the gradient noise as the iterates approach the optimum, which is what the Robbins-Monro conditions require for convergence. A step size that stays too large makes the estimate bounce around the solution.