Machine learning

SGD with Momentum / Adam Optimizer

Stochastic Gradient Descent with Momentum and Adaptive Moment Estimation (Adam) · Also known as: Adam, Adam optimizer, SGD with momentum, momentum SGD, adaptive gradient optimizer, first-order stochastic optimizer

Stochastic Gradient Descent (SGD) with momentum and its adaptive descendant Adam are the foundational parameter-update algorithms used to train virtually every modern deep learning model. Momentum SGD was formalised by Polyak (1964) and brought into neural network training by Rumelhart, Hinton, and Williams (1986). Adam, introduced by Kingma and Ba at ICLR 2015, extended the momentum idea by also maintaining a running average of squared gradients, producing per-parameter adaptive learning rates that make it the default optimizer in contemporary deep learning practice.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

SGD with Momentum / Adam Optimizer

Batch Normalization

When to use it

Adam is appropriate as a default optimizer for training feedforward networks, convolutional networks, recurrent networks, transformers, and other differentiable parametric models, especially when the loss surface is noisy or the gradients are sparse. It is well-suited to any setting with large datasets and mini-batch training, high-dimensional parameter spaces, and heterogeneous gradient magnitudes across layers. Momentum SGD is preferred in some computer vision benchmarks — particularly when training from scratch to convergence with a carefully tuned learning rate schedule — because it has been found to generalise slightly better than Adam in certain regimes. Adam makes fewer assumptions about the structure of the gradient noise than second-order methods, requiring only that the objective function is differentiable. The main assumption is that the optimisation problem is first-order, that is, only gradient (not Hessian) information is used.

Strengths & limitations

Strengths

Per-parameter adaptive learning rates remove the need to hand-tune separate rates for each layer or parameter group.
Computationally efficient: memory and per-step compute cost are O(θ), linear in the number of parameters.
Bias-correction makes Adam well-behaved from the very first iteration, unlike unadjusted exponential moving averages.
Robust to a wide range of hyperparameter choices; the defaults (α=0.001, β1=0.9, β2=0.999, ε=1e-8) work well across many architectures.
Handles sparse gradients effectively, making it suitable for embedding layers and NLP models where many weights receive zero gradient on any given mini-batch.
Momentum SGD with a cosine annealing or one-cycle schedule often achieves state-of-the-art generalisation in image classification, providing a strong alternative to Adam when compute budget permits thorough tuning.

Limitations

Adam can converge to suboptimal solutions in some non-convex settings; Reddi et al. (2018) identified cases where the original Adam algorithm fails to converge and proposed AMSGrad as a fix.
The adaptive learning rates can cause Adam to generalise slightly worse than well-tuned SGD with momentum on some image classification benchmarks, a phenomenon studied by Wilson et al. (2017).
Adam introduces additional memory overhead: two additional vectors (m and v) of the same size as the parameter vector must be stored, doubling the optimizer state relative to plain SGD.
Sensitivity to the global learning rate α remains: an improperly chosen α can cause training instability or extremely slow convergence despite the per-parameter adaptation.
Weight decay in Adam is not equivalent to L2 regularisation because the adaptive scaling interacts with the decay term; AdamW (Loshchilov & Hutter, 2019) corrects this by decoupling weight decay from the gradient update.

Frequently asked

Should I use Adam or SGD with momentum for training a new deep learning model?

Adam is almost always the safer default choice because it requires less learning rate tuning and converges reliably across architectures. SGD with momentum can achieve slightly better generalisation on some image classification benchmarks when the learning rate schedule is carefully tuned (e.g. one-cycle or cosine annealing), but it requires more effort to configure. For NLP and transformer models, AdamW is the standard.

What is the difference between Adam and AdamW?

Standard Adam applies L2 weight decay by adding λθ to the gradient before the adaptive scaling, which means the effective regularisation strength varies per parameter with the magnitude of the gradient. AdamW decouples weight decay from the gradient update, subtracting λθ directly from the parameter after the adaptive step. This separation has been shown to improve regularisation and is the recommended variant for fine-tuning large pretrained models.

Why does Adam sometimes generalise worse than SGD?

Wilson et al. (2017) showed empirically that adaptive gradient methods can converge to sharper minima that generalise less well than the flatter minima found by SGD with momentum. The underlying reason is that the per-parameter scaling can allow Adam to fit certain directions of the loss surface more aggressively. This effect is most pronounced in image classification and is less evident in NLP tasks where sparse gradients make adaptivity more beneficial.

How should I set the learning rate for Adam?

The default of α=0.001 works well for training from scratch in many settings, but it is always worth verifying with a brief learning rate range test. For fine-tuning pretrained transformers, values in the range 1e-5 to 5e-5 are typical. Combining Adam with a warmup schedule (linearly increasing α for the first 5–10% of training steps) followed by cosine decay is a robust strategy for both computer vision and NLP tasks.

Sources

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR 2015). arXiv:1412.6980. link ↗
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. DOI: 10.1038/323533a0 ↗
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. DOI: 10.1016/0041-5553(64)90137-5 ↗
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning (Ch. 8: Optimization for Training Deep Models). MIT Press. ISBN: 978-0-262-03561-3

How to cite this page

ScholarGate. (2026, June 3). Stochastic Gradient Descent with Momentum and Adaptive Moment Estimation (Adam). ScholarGate. https://scholargate.app/en/deep-learning/stochastic-gradient-descent-with-momentum-adam-optimizer

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Batch NormalizationDeep learning↔ compare

Compare side by side →

Related reference concepts

Backpropagation and Optimization Stochastic Optimization Hyperparameter Optimization Deep Learning Regularization and Model Complexity Bias-Variance and Overfitting

Spotted an issue on this page? Report or suggest a fix →

Machine learning

SGD with Momentum / Adam Optimizer

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

SGD with Momentum / Adam Optimizer

Batch Normalization

When to use it

Strengths & limitations

Strengths

Per-parameter adaptive learning rates remove the need to hand-tune separate rates for each layer or parameter group.
Computationally efficient: memory and per-step compute cost are O(θ), linear in the number of parameters.
Bias-correction makes Adam well-behaved from the very first iteration, unlike unadjusted exponential moving averages.
Robust to a wide range of hyperparameter choices; the defaults (α=0.001, β1=0.9, β2=0.999, ε=1e-8) work well across many architectures.
Handles sparse gradients effectively, making it suitable for embedding layers and NLP models where many weights receive zero gradient on any given mini-batch.
Momentum SGD with a cosine annealing or one-cycle schedule often achieves state-of-the-art generalisation in image classification, providing a strong alternative to Adam when compute budget permits thorough tuning.

Limitations

Adam can converge to suboptimal solutions in some non-convex settings; Reddi et al. (2018) identified cases where the original Adam algorithm fails to converge and proposed AMSGrad as a fix.
The adaptive learning rates can cause Adam to generalise slightly worse than well-tuned SGD with momentum on some image classification benchmarks, a phenomenon studied by Wilson et al. (2017).
Adam introduces additional memory overhead: two additional vectors (m and v) of the same size as the parameter vector must be stored, doubling the optimizer state relative to plain SGD.
Sensitivity to the global learning rate α remains: an improperly chosen α can cause training instability or extremely slow convergence despite the per-parameter adaptation.
Weight decay in Adam is not equivalent to L2 regularisation because the adaptive scaling interacts with the decay term; AdamW (Loshchilov & Hutter, 2019) corrects this by decoupling weight decay from the gradient update.

Frequently asked

Should I use Adam or SGD with momentum for training a new deep learning model?

What is the difference between Adam and AdamW?

Why does Adam sometimes generalise worse than SGD?

How should I set the learning rate for Adam?

Sources

Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR 2015). arXiv:1412.6980. link ↗
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536. DOI: 10.1038/323533a0 ↗
Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17. DOI: 10.1016/0041-5553(64)90137-5 ↗
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning (Ch. 8: Optimization for Training Deep Models). MIT Press. ISBN: 978-0-262-03561-3

How to cite this page

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

Batch NormalizationDeep learning↔ compare

Compare side by side →

Related reference concepts

Backpropagation and Optimization Stochastic Optimization Hyperparameter Optimization Deep Learning Regularization and Model Complexity Bias-Variance and Overfitting

Spotted an issue on this page? Report or suggest a fix →

SGD with Momentum / Adam Optimizer

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts

SGD with Momentum / Adam Optimizer

Read the full method

Method map

When to use it

Strengths & limitations

Frequently asked

Sources

How to cite this page

Which method?

Similar methods

Related reference concepts