What

Batch normalization is applied to individual layers, or optionally, to all of them: In each training iteration, we first normalize the inputs (of batch normalization) by subtracting their mean and dividing by their standard deviation, where both are estimated based on the statistics of the current minibatch.

Next, we apply a scale coefficient and an offset to recover the lost degrees of freedom.

Note that if we tried to apply batch normalization with minibatches of size 1, we would not be able to learn anything. That is because after subtracting the means, each hidden unit would take value 0. when applying batch normalization, the choice of batch size is even more significant than without batch normalization, batch normalization works best for moderate minibatch sizes in the 50–100 range. Denote by $B$ a minibatch and let $x \in B$ be an input to batch normalization ($BN$). In this case the batch normalization is defined as follows: $$ BN(x) = \gamma \times \frac{x-\mu_{B}}{\sigma_{B}} + \beta $$ $\mu_{B}$is the sample mean and $\sigma_{B}$ is the sample standard deviation of the minibatch . *scale parameter* $\gamma$ and *shift parameter* $\beta$ have the same shape as $x$ and are parameters that need to be learned as part of model training. batch normalization layers function differently in *training mode*than in *prediction mode*. # Batch Normalization Layers Batch normalization implementations for fully connected layers and convolutional layers are slightly different. ## Fully Connected Layers Denoting the input to the fully connected layer by $x$, the affine transformation by (with the weight parameter $W$ and the bias parameter $b$), and the activation function by $\phi$, we can express the computation of a batch-normalization-enabled, fully connected layer output $h$ as follows: $$ h = \phi(BN(Wx+b)) $$ batch normalization usually is applied before activation function, but it can also applied after activation function. Moreover, there is no need to batch normalization and dropout simultaneously

Convolutional Layers

Similarly, with convolutional layers, we can apply batch normalization after the convolution but before the nonlinear activation function. The key difference from batch normalization in fully connected layers is that we apply the operation on a per-channel basis across all locations. In other word, batch normalization is applied in the demension of channels, channels is kinda like the features that in the full connected layer.

Assume that our minibatches contain $m$ examples and that for each channel, the output of the convolution has height $p$ and width $q$. For convolutional layers, we carry out each batch normalization over the $m \cdot p \cdot q$ elements per output channel simultaneously. Each channel has its own scale $\gamma$ and shift $\beta$ parameters.

Layer Normalization

Note that in the context of convolutions the batch normalization is well defined even for minibatches of size 1: after all, we have all the locations across an image to average. Consequently, mean and variance are well defined, even if it is just within a single observation. This consideration led to introduce the notion of layer normalization. It works just like a batch norm, only that it is applied to one observation at a time.

For an n-dimensional vector $x$, layer norms are given by

$LN(x) = \frac{x-\mu}{\sigma}$

layer normalization does not depend on the minibatch size. It is also independent of whether we are in training or test regime.

Conclusion: difference between batch normalization and layer normalization

batch normalization is applied on same feature with different samples, while layer normalization is applied on differenct featurens with only one sample.

What Can Batch normalization achieve

Bacth normalization can speed up convergence, so that we can change learning rate to a bigger value