Machine learning

Batch Normalization

Batch Normalization (Normalizing Layer Activations per Mini-Batch) · Also known as: BatchNorm, BN, batch norm, mini-batch normalization, internal covariate shift reduction

Batch Normalization is a training technique introduced by Sergey Ioffe and Christian Szegedy in 2015 that normalizes the pre-activation outputs of each layer using the mean and variance computed over the current mini-batch. By stabilizing the input distribution to each layer throughout training, it substantially reduces internal covariate shift, enabling the use of higher learning rates and making deep networks train faster and more reliably.

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Batch Normalization

Dropout AlexNet SGD with Momentum / Adam…

When to use it

Batch Normalization is applicable to virtually any feedforward or convolutional neural network trained with stochastic gradient descent on mini-batches. It is most beneficial for deep architectures (typically more than four layers) where internal covariate shift is severe, and it is essentially standard practice in convolutional networks for image classification, object detection, and image generation. The technique assumes a sufficiently large mini-batch (typically 32 or more examples) so that the batch statistics are reasonable estimates of the population statistics — with very small batches the estimates become noisy and Layer Normalization or Group Normalization may be preferable. Batch Normalization is less effective in recurrent networks due to variable sequence lengths, where Layer Normalization is the preferred alternative.

Strengths & limitations

Strengths

Substantially reduces internal covariate shift, enabling the use of larger learning rates and faster convergence.
Acts as a regulariser, often reducing or eliminating the need for dropout in convolutional networks.
Reduces sensitivity to weight initialisation, making networks easier to tune.
Enables training of very deep networks that would otherwise suffer from vanishing or exploding gradients.
Computationally inexpensive relative to the training speedup it provides.

Limitations

Requires a sufficiently large mini-batch; with very small batches (fewer than 8–16 examples) the batch statistics are noisy and the regularisation effect becomes harmful rather than helpful.
Introduces a discrepancy between training behaviour (uses batch statistics) and inference behaviour (uses running estimates), which must be managed carefully when fine-tuning or transferring models.
Not directly suited to recurrent or autoregressive architectures where sequence lengths vary across examples in a batch.
Adds two learnable parameters per feature per normalised layer, and slightly increases memory usage to store running statistics.

Frequently asked

Should Batch Normalization be placed before or after the activation function?

The original Ioffe & Szegedy paper places it immediately before the activation function — that is, after the linear transformation but before applying ReLU or another non-linearity. In practice, both orderings are used and the difference is often small; however, placing it before the activation is the conventional default unless a specific architecture specifies otherwise.

Why does Batch Normalization behave differently during training and inference?

During training, the mean and variance are computed from the current mini-batch, introducing a form of stochastic regularisation. At inference, using mini-batch statistics would make predictions depend on what other samples happen to be in the batch, which is undesirable. Instead, running estimates of the population mean and variance accumulated during training are used, making inference deterministic. In most frameworks this is controlled by setting the model to evaluation mode (e.g., model.eval() in PyTorch).

What is the role of the learnable parameters gamma and beta?

After normalisation to zero mean and unit variance, the layer has lost the ability to represent any other mean or scale. Gamma and beta restore this capacity: gamma scales the normalised activation and beta shifts it. Learned by backpropagation, these parameters allow the network to represent the identity transformation when normalisation is unhelpful, preserving the network's full expressive power.

When should I use Layer Normalization or Group Normalization instead?

Layer Normalization normalises across the feature dimension for each individual example, making it independent of batch size; it is the preferred choice for recurrent networks and Transformers. Group Normalization divides channels into groups and normalises within each group, performing well with small mini-batches where batch statistics are unreliable. Use Batch Normalization when mini-batch sizes are comfortably above 16 and the architecture is a standard feedforward or convolutional network.

Sources

Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), PMLR 37, 448–456. link ↗
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning (Ch. 8). MIT Press. ISBN: 978-0-262-03561-3
Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167. link ↗

How to cite this page

ScholarGate. (2026, June 3). Batch Normalization (Normalizing Layer Activations per Mini-Batch). ScholarGate. https://scholargate.app/en/deep-learning/batch-normalization

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

DropoutDeep learning↔ compare

Compare side by side →

Referenced by

AlexNet Dropout SGD with Momentum / Adam Optimizer

Related reference concepts

Backpropagation and Optimization Neural Network Architectures Bias-Variance and Overfitting Deep Learning Regularization and Model Complexity Convolutional and Sequence Models

Spotted an issue on this page? Report or suggest a fix →

Machine learning

Batch Normalization

Batch Normalization (Normalizing Layer Activations per Mini-Batch) · Also known as: BatchNorm, BN, batch norm, mini-batch normalization, internal covariate shift reduction

Tools & resources

Download slides

Learn & explore

Read the full method

Members only

Method map

The neighbourhood of related methods — select a node to explore.

Batch Normalization

Dropout AlexNet SGD with Momentum / Adam…

When to use it

Strengths & limitations

Strengths

Substantially reduces internal covariate shift, enabling the use of larger learning rates and faster convergence.
Acts as a regulariser, often reducing or eliminating the need for dropout in convolutional networks.
Reduces sensitivity to weight initialisation, making networks easier to tune.
Enables training of very deep networks that would otherwise suffer from vanishing or exploding gradients.
Computationally inexpensive relative to the training speedup it provides.

Limitations

Requires a sufficiently large mini-batch; with very small batches (fewer than 8–16 examples) the batch statistics are noisy and the regularisation effect becomes harmful rather than helpful.
Introduces a discrepancy between training behaviour (uses batch statistics) and inference behaviour (uses running estimates), which must be managed carefully when fine-tuning or transferring models.
Not directly suited to recurrent or autoregressive architectures where sequence lengths vary across examples in a batch.
Adds two learnable parameters per feature per normalised layer, and slightly increases memory usage to store running statistics.

Frequently asked

Should Batch Normalization be placed before or after the activation function?

Why does Batch Normalization behave differently during training and inference?

What is the role of the learnable parameters gamma and beta?

When should I use Layer Normalization or Group Normalization instead?

Sources

Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), PMLR 37, 448–456. link ↗
Goodfellow, I., Bengio, Y. & Courville, A. (2016). Deep Learning (Ch. 8). MIT Press. ISBN: 978-0-262-03561-3
Ioffe, S. & Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv preprint arXiv:1502.03167. link ↗

How to cite this page

ScholarGate. (2026, June 3). Batch Normalization (Normalizing Layer Activations per Mini-Batch). ScholarGate. https://scholargate.app/en/deep-learning/batch-normalization

Which method?

Set this method beside its closest kin and read them side by side — the library lays the books on the table; the choice is yours.

DropoutDeep learning↔ compare

Compare side by side →

Referenced by

AlexNet Dropout SGD with Momentum / Adam Optimizer

Similar methods

Related reference concepts

Backpropagation and Optimization Neural Network Architectures Bias-Variance and Overfitting Deep Learning Regularization and Model Complexity Convolutional and Sequence Models

Spotted an issue on this page? Report or suggest a fix →