Process / pipeline

Stochastic Optimization — SGD and Variants

Stochastic Optimization (SGD and Variants) · Also known as: Stokastik Optimizasyon (SGD & Varyantları), stochastic gradient descent, SGD, Adam, RMSProp, AdaGrad

Stochastic optimization is a family of iterative methods that minimize an objective function by computing gradients on randomly sampled subsets of data — mini-batches — rather than on the entire dataset at once. Pioneered by Robbins and Monro in 1951 as stochastic approximation, the approach became the standard engine for training large-scale machine-learning models through variants such as SGD with momentum, AdaGrad, RMSProp, and Adam.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Stochastic Optimization

Bayesian Optimization Evolutionary Strategy Robust Optimization Linear Programming Newsvendor Model Nonlinear Programming Simheuristics

When to use it

Stochastic optimization applies whenever you need to minimise a differentiable (or sub-differentiable) loss function over a large dataset where computing exact gradients on the full dataset at every step is computationally prohibitive. It is the standard approach for training neural networks, linear models with large data, and any parametric model fitted by gradient descent. The objective must admit a gradient or subgradient; the method does not apply to non-differentiable black-box problems. Learning rate selection is critical — too large a rate causes divergence, too small a rate slows convergence.

Strengths & limitations

Strengths

Scales to arbitrarily large datasets because each update uses only a mini-batch, not the full data.
Gradient noise from mini-batch sampling can help escape shallow local minima and saddle points.
Adaptive variants (Adam, RMSProp) reduce the burden of manual learning-rate tuning by adjusting per-parameter step sizes automatically.
Widely supported across all major deep-learning frameworks and well-understood theoretically.

Limitations

Learning rate is a critical hyperparameter; poor choices lead to divergence or extremely slow convergence.
Mini-batch gradient noise can prevent convergence to an exact minimum — the optimiser typically oscillates near the optimum rather than settling precisely.
Saddle points and poor local minima are genuine risks in non-convex landscapes, though gradient noise and momentum help mitigate them.
Adam and other adaptive methods can converge to sharper minima that generalise worse than the flatter minima found by plain SGD with careful tuning.

Frequently asked

Which optimiser should I choose — SGD, Adam, or something else?

Adam is a safe default for most deep learning tasks: it is robust to learning rate choice and converges quickly. SGD with momentum and a tuned learning rate schedule often achieves slightly better final generalisation on image classification benchmarks, but requires more careful tuning. AdaGrad suits sparse gradient problems (e.g., NLP with bag-of-words features), while RMSProp works well in recurrent network training.

How do I choose the learning rate?

The learning rate is the most important hyperparameter. Common practice is to start with 1e-3 for Adam and 1e-1 for SGD, then use a learning rate finder or grid search. Applying a warm-up phase followed by cosine or step decay nearly always improves results over a fixed rate.

What mini-batch size should I use?

Batch sizes between 32 and 256 are common. Smaller batches introduce more gradient noise (which can aid generalisation) and allow more frequent updates; larger batches reduce noise but may generalise less well and require more careful learning-rate scaling. When doubling the batch size, scale the learning rate proportionally (linear scaling rule) as a starting point.

How do I know when to stop training?

Monitor validation loss and stop when it ceases to improve for a set number of consecutive epochs — a technique called early stopping. This is more reliable than stopping at a fixed number of epochs and guards against overfitting.

Sources

Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics, 22(3), 400-407. DOI: 10.1214/aoms/1177729586 ↗
Kingma, D.P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR 2015). link ↗

How to cite this page

ScholarGate. (2026, June 1). Stochastic Optimization (SGD and Variants). ScholarGate. https://scholargate.app/en/optimization/stochastic-optimization

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Bayesian OptimizationOptimization↔ compare
Evolutionary StrategyOptimization↔ compare
Robust OptimizationOptimization↔ compare

Compare side by side →

Referenced by

Bayesian Optimization Linear Programming Newsvendor Model Nonlinear Programming Robust Optimization Simheuristics

Related reference concepts

Stochastic Optimization Backpropagation and Optimization Hyperparameter Optimization Deep Learning Policy Gradient Methods Optimization for Statistics

Spotted an issue on this page? Report or suggest a fix →

Stochastic Optimization — SGD and Variants

Stochastic Optimization (SGD and Variants) · Also known as: Stokastik Optimizasyon (SGD & Varyantları), stochastic gradient descent, SGD, Adam, RMSProp, AdaGrad

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

When to use it

Strengths & limitations

Strengths

Scales to arbitrarily large datasets because each update uses only a mini-batch, not the full data.
Gradient noise from mini-batch sampling can help escape shallow local minima and saddle points.
Adaptive variants (Adam, RMSProp) reduce the burden of manual learning-rate tuning by adjusting per-parameter step sizes automatically.
Widely supported across all major deep-learning frameworks and well-understood theoretically.

Limitations

Learning rate is a critical hyperparameter; poor choices lead to divergence or extremely slow convergence.
Mini-batch gradient noise can prevent convergence to an exact minimum — the optimiser typically oscillates near the optimum rather than settling precisely.
Saddle points and poor local minima are genuine risks in non-convex landscapes, though gradient noise and momentum help mitigate them.
Adam and other adaptive methods can converge to sharper minima that generalise worse than the flatter minima found by plain SGD with careful tuning.

Frequently asked

Which optimiser should I choose — SGD, Adam, or something else?

How do I choose the learning rate?

What mini-batch size should I use?

How do I know when to stop training?

Sources

Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method. Annals of Mathematical Statistics, 22(3), 400-407. DOI: 10.1214/aoms/1177729586 ↗
Kingma, D.P. & Ba, J. (2015). Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (ICLR 2015). link ↗

How to cite this page

ScholarGate. (2026, June 1). Stochastic Optimization (SGD and Variants). ScholarGate. https://scholargate.app/en/optimization/stochastic-optimization

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Bayesian OptimizationOptimization↔ compare
Evolutionary StrategyOptimization↔ compare
Robust OptimizationOptimization↔ compare

Compare side by side →

Stochastic Optimization — SGD and Variants

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Stochastic Optimization — SGD and Variants

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts