Sitemap

Layer Normalization, and how to compute its Jacobian for Backpropagation?

Step by step implementation in Python

5 min readDec 19, 2021

--

In this regularization technique, we normalize the layer. How do we do that? Let us take a look.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.5.1 Layer Normalization, Part I

Suppose we have x,

Then, ‘y’ is

Press enter or click to view image in full size

where,

and,

Press enter or click to view image in full size

Here, N = 3 because we have 3 entries in x.

We can reduce the normalization function in Python like this

import numpy as np                             # importing NumPy
np.random.seed(42)
def normalize(x): # Normalization

mean = x.mean(axis = 0)
std = x.std(axis = 0)

return (x - mean) / (std + 10**-100)
Press enter or click to view image in full size

Now, the most important question is how to compute its derivative/Jacobian?

Similar to the Softmax function, we have a situation like this.

Press enter or click to view image in full size

So, the Jacobian ‘J’ for the normalization is

Press enter or click to view image in full size

Let us start finding each term in this Jacobian.

Starting with the first term, i.e.,

we know that,

So,

and,

So,

Press enter or click to view image in full size

which gives,

Press enter or click to view image in full size

Therefore, we have

Press enter or click to view image in full size

Now let us calculate the second term, i.e.,

which gives,

Press enter or click to view image in full size

After finding every term in the Jacobian, we have a symmetric matrix

Press enter or click to view image in full size

We can reduce it like this

Press enter or click to view image in full size
Press enter or click to view image in full size
Press enter or click to view image in full size
def normalize_dash(x):

N = x.shape[0]
I = np.eye(N)
mean = x.mean(axis = 0)
std = x.std(axis = 0)

return ((N * I - 1) / (N * std + 10**-100)) - (( (x - mean)
.dot((x - mean).T) ) / (N * std**3 + 10**-100))
Press enter or click to view image in full size

Let us have a look at an example

x = np.array([[0.2], [0.5], [1.2], [-1.6], [0.5]])
x
normalize(x)normalize_dash(x)normalize_dash(x) == normalize_dash(x).T
Press enter or click to view image in full size
Press enter or click to view image in full size

We can see that the Normalization Jacobian is symmetric.

We also do scaling and shifting.
By that we mean, We multiply the normalized layer by a scalar and then we add a scalar.

The multiplication scalar is called ‘gamma’ and the addition scalar is called ‘beta’. ‘gamma’ is initialized as ones and ‘beta’ is initialized as ‘zeros’. They are parameters which we will update via some Optimizers.

gamma = np.ones(shape = x.shape)                 # gamma for scaling
gamma
beta = np.zeros(shape = x.shape) # beta for shifting
beta
scaled_shifted = gamma * normalize(x) + beta
scaled_shifted
Press enter or click to view image in full size
Press enter or click to view image in full size

I hope you understand now how Normalization works and how to compute the Jacobian for backpropagation. In the next post, we will implement Layer Normalization in ANNs.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 5.5.2 Layer Normalization, Part II.

--

--

Responses (1)