Layer Normalization applied on a Neural Network

Step by step implementation in Python

neuralthreads
7 min readDec 19, 2021

In this post, we will see how to apply Layer Normalization which we studied in the last post.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.5.2 Layer Normalization, Part II

We will use two different layers for Normalization and scaling-shifting.

Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use SGD Optimizer with a learning rate = 0.01
Fifth, For the first hidden layer, the Layer Normalization will be applied after the ReLU activation function and for the second hidden layer, the Layer Normalization will be applied before the ReLU activation function.

‘g_H1’ and ‘b_H1’ represent and gamma(s) and beta(s) for the first hidden layer with elements ‘g_H1₁’, ‘g_H1₂’, …, ‘b_H1₁’… and ‘g_H2’ and ‘b_H2’ represent and gamma(s) and beta(s) for the second hidden layer with elements ‘g_H2₁’, ‘g_H2₂’, …, ‘b_H2₁’…

Now, let us have a look at the steps.

Step 1 - A forward feed like we did in the previous post with Layer
Normalization
Step 2 - Initializing SGD Optimizer
Step 3 - Entering the training loop
Step 3.1 - Forward feed to see loss before training
Step 3.2 - Calculating gradients via Backpropagation
Step 3.3 - Updating weights, biases, gamma and betas via
SGD Optimizer
Step 4 - A forward feed with Layer Normalization to verify that the
loss has been reduced and to see how close predicted values
are to true values

Note — We will use the trained values of gammas and betas because they also store the general pattern or information.

Step 1 — Forward feed with Layer Normalization

import numpy as np                          # importing NumPy
np.random.seed(42)
input_nodes = 5 # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4
x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x # Inputs
y = np.array([[0], [1], [0], [0]])
y # Outputs
def relu(x, leak = 0):                          # ReLU
return np.where(x <= 0, leak * x, x)
def relu_dash(x, leak = 0): # ReLU derivative
return np.where(x <= 0, leak, 1)
def softmax(x): # Softmax
return np.exp(x) / np.sum(np.exp(x))
def softmax_dash(x): # Softmax derivative

I = np.eye(x.shape[0])

return softmax(x) * (I - softmax(x).T)
def normalize(x): # Layer Normalization

mean = x.mean(axis = 0)
std = x.std(axis = 0)

return (x - mean) / (std + 10**-100)
def normalize_dash(x): # Normalization derivative

N = x.shape[0]
I = np.eye(N)
mean = x.mean(axis = 0)
std = x.std(axis = 0)

return ((N * I - 1) / (N * std + 10**-100)) - (( (x - mean)
.dot((x - mean).T) ) / (N * std**3 + 10**-100))
def cross_E(y_true, y_pred): # CE
return -np.sum(y_true * np.log(y_pred + 10**-100))
def cross_E_grad(y_true, y_pred): # CE derivative
return -y_true/(y_pred + 10**-100)
w1 = np.random.random(size = (hidden_1_nodes, input_nodes))     # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1)) # b1
gamma_H1 = np.ones(shape = (hidden_1_nodes, 1)) # gamma_H1
beta_H1 = np.zeros(shape = (hidden_1_nodes, 1)) # beta_H1
w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1)) # b2
gamma_H2 = np.ones(shape = (hidden_2_nodes, 1)) # gamma_H2
beta_H2 = np.zeros(shape = (hidden_2_nodes, 1)) # beta_H2
w3 = np.random.random(size = (output_nodes, hidden_2_nodes)) # w3
b3 = np.zeros(shape = (output_nodes, 1)) # b3
in_hidden_1 = w1.dot(x) + b1                         # forward feed
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1
in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2)
in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)
y_hat # y_haty # ycross_E(y, y_hat) # loss

Step 2 — Initializing SGD Optimizer

learning_rate = 0.01

Step 3— Enter training loop

epochs = 200

Step 3.1 — Forward feed to see loss before training

for epoch in range(epochs):

#----------------forward feed to see loss-----------------------

in_hidden_1 = w1.dot(x) + b1
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1
in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2)
in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)

loss = cross_E(y, y_hat)

print(f'loss before training is {loss} -- epoch number {epoch +
1}')
print('\n')

Now, before going to Backpropagation, I want you to notice two things.
First, the gamma(s) are analogous to weights. The only difference is they are simply used for multiplication one-on-one. When we will jump back through gamma(s), we will simply take gamma(s) in our gradients as we do in the case of the sigmoid, or ReLU.

For the first hidden layer

And, for the second hidden layer

Second, the beta(s) are analogous to biases. The gradients for beta(s) will be calculated in the same way as we did for biases.

If you have any doubts, then I suggest you go through every step of ‘Jumping Back’ with the help of the architecture diagram above.

#--------------Calculating gradients via Backpropagation------------

error_upto_softmax = np.sum(cross_E_grad(y, y_hat) *
softmax_dash(in_output_layer), axis = 0).reshape((-1, 1))

grad_w3 = error_upto_softmax .dot( out_hidden_2.T )

grad_b3 = error_upto_softmax

#------------------------------------

error_grad_H2 = np.sum(error_upto_softmax * w3, axis =
0).reshape((-1, 1))

grad_beta_H2 = error_grad_H2 * relu_dash(scaled_shifted_H2)

grad_gamma_H2 = error_grad_H2 * relu_dash(scaled_shifted_H2) *
normalized_H2

error_upto_normalize_H2 = np.sum(error_grad_H2 *
relu_dash(scaled_shifted_H2) * gamma_H2 *
normalize_dash(in_hidden_2),
axis = 0).reshape((-1, 1))

grad_w2 = error_upto_normalize_H2 .dot( out_hidden_1.T )

grad_b2 = error_upto_normalize_H2

#-----------------------------------

error_grad_H1 = np.sum(error_upto_normalize_H2 * w2, axis =
0).reshape((-1, 1))

grad_beta_H1 = error_grad_H1

grad_gamma_H1 = error_grad_H1 * normalized_H1

error_upto_normalize_H1 = np.sum(error_grad_H1 * gamma_H1 *
normalize_dash(activated_H1), axis = 0).reshape((-1, 1))

grad_w1 = error_upto_normalize_H1 * relu_dash(in_hidden_1) .dot(
x.T )

grad_b1 = error_upto_normalize_H1 * relu_dash(in_hidden_1)

Normalization Jacobian is analogous to the Softmax Jacobian, due to its definition.

Step 3.3 — Using SGD Optimizer to update weights, biases, gamma and betas

    update_w1 = - learning_rate * grad_w1
w1 += update_w1 # w1

update_b1 = - learning_rate * grad_b1
b1 += update_b1 # b1

update_gamma_H1 = - learning_rate * grad_gamma_H1
gamma_H1 += update_gamma_H1 # gamma_H1

update_beta_H1 = - learning_rate * grad_beta_H1
beta_H1 += update_beta_H1 # beta_H1

update_w2 = - learning_rate * grad_w2
w2 += update_w2 # w2

update_b2 = - learning_rate * grad_b2
b2 += update_b2 # b2

update_gamma_H2 = - learning_rate * grad_gamma_H2
gamma_H2 += update_gamma_H2 # gamma_H2

update_beta_H2 = - learning_rate * grad_beta_H2
beta_H2 += update_beta_H2 # beta_H2

update_w3 = - learning_rate * grad_w3
w3 += update_w3 # w3

update_b3 = - learning_rate * grad_b3
b3 += update_b3 # b3

The training loop will run 200 times.

This is a small screenshot after the training.

Step 4— A forward feed to see how close predicted values are to true values

in_hidden_1 = w1.dot(x) + b1                    # forward feed
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1
in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2)
in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)
y_hat # predicted valuesy # true valuescross_E(y, y_hat) # loss

I hope now you understand how to apply Layer Normalization in Neural Networks.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 5.6 Batch Training.

--

--

No responses yet