Machine learning

Stochastic Gradient Descent (SGD)

Stochastic Gradient Descent (SGD) Optimization Algorithm · Also known as: SGD, online gradient descent, incremental gradient descent, mini-batch gradient descent, stochastic approximation gradient method

Stochastic Gradient Descent (SGD) is a first-order iterative optimization algorithm, rooted in the stochastic approximation framework introduced by Robbins and Monro in 1951, that minimizes an objective function by updating model parameters using the gradient computed on a single randomly selected training example (or a small mini-batch) at each step. It is the core optimization engine behind modern machine learning and deep learning, enabling the training of models on datasets too large to fit in memory.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Stochastic Gradient Descent

Logistic Regression Random Forest XGBoost Federated Learning Online Federated Learning Online Gaussian Process Online Linear Regression Policy Gradient Regularized Online Learn…

When to use it

SGD is the method of choice whenever the training set is large enough that computing the full gradient per update is too costly, which encompasses virtually all modern neural network training and large-scale linear model fitting. It applies to any differentiable loss function. Key assumptions: the loss is differentiable (or sub-differentiable) with respect to the parameters, data points are sampled independently, and the gradient estimator is unbiased. For strongly convex objectives SGD achieves O(1/T) convergence; for general non-convex objectives (e.g., deep networks) it finds stationary points in practice, though no global optimality guarantee exists. Mini-batch SGD (batch sizes 32–512) is typically preferred over pure single-example SGD because it provides lower-variance gradient estimates and exploits parallelism on modern hardware. When the dataset is small enough for full-batch gradient descent and the objective is well-conditioned, second-order methods such as L-BFGS may converge faster.

Strengths & limitations

Strengths

Scales to arbitrarily large datasets because each update requires only one example or a mini-batch, keeping memory and compute per step constant.
Built-in noise provides implicit regularization and helps escape shallow local minima and saddle points in non-convex landscapes.
Extremely general: applicable to any differentiable loss, including cross-entropy, mean squared error, hinge loss, and custom objectives.
Forms the foundation of nearly all modern optimizers (Adam, RMSProp, AdaGrad, Nesterov momentum) — understanding SGD is prerequisite to understanding them.
Straightforward to implement and parallelize across GPUs using mini-batches.

Limitations

Gradient estimates are noisy, which causes the loss to fluctuate rather than decrease monotonically, making convergence harder to diagnose.
Highly sensitive to the learning rate: too large causes divergence; too small makes training prohibitively slow.
Convergence to a global minimum is not guaranteed for non-convex objectives such as deep neural networks.
Isotropic updates treat all parameter dimensions equally, making it poorly suited to loss surfaces with very different curvatures along different axes — motivating adaptive-rate variants.
Requires careful tuning of the learning rate schedule; poor schedules substantially degrade final performance.

Frequently asked

What is the difference between SGD, mini-batch SGD, and full-batch gradient descent?

Full-batch gradient descent computes the exact gradient over all n training examples per update — accurate but O(n) cost per step. Pure SGD uses exactly one randomly chosen example per update — O(1) cost but very noisy. Mini-batch SGD uses a small random subset (typically 32–512 examples) — a practical compromise that reduces noise while keeping cost manageable and exploiting GPU parallelism. In modern usage, 'SGD' almost always means mini-batch SGD.

How should I set the learning rate?

There is no universal answer, but a common practice is to start with a moderate value (e.g., 0.01 or 0.1), monitor the training loss curve, and apply a decay schedule such as step decay, cosine annealing, or warm restarts. Learning rate finders (sweeping rates over a few batches and observing the loss) can identify a good initial value. The Robbins–Monro theory requires the rate to decay to zero, but in practice the decay is often applied gently to avoid premature slowdown.

Why does SGD sometimes generalize better than Adam?

Adaptive-rate optimizers like Adam often converge faster but can overfit more, particularly on image classification tasks. The higher noise in SGD — especially at small batch sizes — acts as implicit regularization, helping the model find flatter minima that generalize better. Several empirical studies (e.g., Wilson et al., 2017) have documented this phenomenon, which is why SGD with momentum is still preferred for training computer vision models such as ResNet.

Does SGD guarantee finding the global minimum?

Only for strictly convex objectives, under the Robbins–Monro step-size conditions. For non-convex objectives (virtually all deep networks), SGD provides convergence to a stationary point (where the gradient is zero), which in practice tends to be a good local minimum rather than a saddle point or the global minimum. Empirically, the solutions found are often good enough for state-of-the-art performance.

Sources

Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407. DOI: 10.1214/aoms/1177729586 ↗
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning (Ch. 8). MIT Press. ISBN: 978-0-262-03561-3

How to cite this page

ScholarGate. (2026, June 3). Stochastic Gradient Descent (SGD) Optimization Algorithm. ScholarGate. https://scholargate.app/en/machine-learning/stochastic-gradient-descent

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Logistic RegressionResearch Statistics↔ compare
Random ForestMachine learning↔ compare
XGBoostMachine learning↔ compare

Compare side by side →

Referenced by

Federated Learning Online Federated Learning Online Gaussian Process Online Linear Regression Policy Gradient Regularized Online Learning

Related reference concepts

Stochastic Optimization Backpropagation and Optimization Hyperparameter Optimization Deep Learning Policy Gradient Methods Bias-Variance and Overfitting

Spotted an issue on this page? Report or suggest a fix →

Stochastic Gradient Descent (SGD)

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

When to use it

Strengths & limitations

Strengths

Scales to arbitrarily large datasets because each update requires only one example or a mini-batch, keeping memory and compute per step constant.
Built-in noise provides implicit regularization and helps escape shallow local minima and saddle points in non-convex landscapes.
Extremely general: applicable to any differentiable loss, including cross-entropy, mean squared error, hinge loss, and custom objectives.
Forms the foundation of nearly all modern optimizers (Adam, RMSProp, AdaGrad, Nesterov momentum) — understanding SGD is prerequisite to understanding them.
Straightforward to implement and parallelize across GPUs using mini-batches.

Limitations

Gradient estimates are noisy, which causes the loss to fluctuate rather than decrease monotonically, making convergence harder to diagnose.
Highly sensitive to the learning rate: too large causes divergence; too small makes training prohibitively slow.
Convergence to a global minimum is not guaranteed for non-convex objectives such as deep neural networks.
Isotropic updates treat all parameter dimensions equally, making it poorly suited to loss surfaces with very different curvatures along different axes — motivating adaptive-rate variants.
Requires careful tuning of the learning rate schedule; poor schedules substantially degrade final performance.

Frequently asked

What is the difference between SGD, mini-batch SGD, and full-batch gradient descent?

How should I set the learning rate?

Why does SGD sometimes generalize better than Adam?

Does SGD guarantee finding the global minimum?

Sources

Robbins, H. & Monro, S. (1951). A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3), 400–407. DOI: 10.1214/aoms/1177729586 ↗
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning (Ch. 8). MIT Press. ISBN: 978-0-262-03561-3

How to cite this page

ScholarGate. (2026, June 3). Stochastic Gradient Descent (SGD) Optimization Algorithm. ScholarGate. https://scholargate.app/en/machine-learning/stochastic-gradient-descent

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Logistic RegressionResearch Statistics↔ compare
Random ForestMachine learning↔ compare
XGBoostMachine learning↔ compare

Compare side by side →

Stochastic Gradient Descent (SGD)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts

Stochastic Gradient Descent (SGD)

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Referenced by

Similar methods

Related reference concepts