Layer Normalization applied on a Neural Network
Step by step implementation in Python
In this post, we will see how to apply Layer Normalization which we studied in the last post.
You can download the Jupyter Notebook from here.
Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.
5.5.2 Layer Normalization, Part II
We will use two different layers for Normalization and scaling-shifting.
Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.
First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use SGD Optimizer with a learning rate = 0.01
Fifth, For the first hidden layer, the Layer Normalization will be applied after the ReLU activation function and for the second hidden layer, the Layer Normalization will be applied before the ReLU activation function.
‘g_H1’ and ‘b_H1’ represent and gamma(s) and beta(s) for the first hidden layer with elements ‘g_H1₁’, ‘g_H1₂’, …, ‘b_H1₁’… and ‘g_H2’ and ‘b_H2’ represent and gamma(s) and beta(s) for the second hidden layer with elements ‘g_H2₁’, ‘g_H2₂’, …, ‘b_H2₁’…
Now, let us have a look at the steps.
Step 1 - A forward feed like we did in the previous post with Layer
Normalization
Step 2 - Initializing SGD Optimizer
Step 3 - Entering the training loop
Step 3.1 - Forward feed to see loss before training
Step 3.2 - Calculating gradients via Backpropagation
Step 3.3 - Updating weights, biases, gamma and betas via
SGD Optimizer
Step 4 - A forward feed with Layer Normalization to verify that the
loss has been reduced and to see how close predicted values
are to true values
Note — We will use the trained values of gammas and betas because they also store the general pattern or information.
Step 1 — Forward feed with Layer Normalization
import numpy as np # importing NumPy
np.random.seed(42)input_nodes = 5 # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4
x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x # Inputsy = np.array([[0], [1], [0], [0]])
y # Outputs
def relu(x, leak = 0): # ReLU
return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0): # ReLU derivative
return np.where(x <= 0, leak, 1)def softmax(x): # Softmax
return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x): # Softmax derivative
I = np.eye(x.shape[0])
return softmax(x) * (I - softmax(x).T)def normalize(x): # Layer Normalization
mean = x.mean(axis = 0)
std = x.std(axis = 0)
return (x - mean) / (std + 10**-100)def normalize_dash(x): # Normalization derivative
N = x.shape[0]
I = np.eye(N)
mean = x.mean(axis = 0)
std = x.std(axis = 0)
return ((N * I - 1) / (N * std + 10**-100)) - (( (x - mean)
.dot((x - mean).T) ) / (N * std**3 + 10**-100))def cross_E(y_true, y_pred): # CE
return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred): # CE derivative
return -y_true/(y_pred + 10**-100)
w1 = np.random.random(size = (hidden_1_nodes, input_nodes)) # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1)) # b1gamma_H1 = np.ones(shape = (hidden_1_nodes, 1)) # gamma_H1
beta_H1 = np.zeros(shape = (hidden_1_nodes, 1)) # beta_H1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1)) # b2gamma_H2 = np.ones(shape = (hidden_2_nodes, 1)) # gamma_H2
beta_H2 = np.zeros(shape = (hidden_2_nodes, 1)) # beta_H2w3 = np.random.random(size = (output_nodes, hidden_2_nodes)) # w3
b3 = np.zeros(shape = (output_nodes, 1)) # b3
in_hidden_1 = w1.dot(x) + b1 # forward feed
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat # y_haty # ycross_E(y, y_hat) # loss
Step 2 — Initializing SGD Optimizer
learning_rate = 0.01
Step 3— Enter training loop
epochs = 200
Step 3.1 — Forward feed to see loss before training
for epoch in range(epochs):
#----------------forward feed to see loss-----------------------
in_hidden_1 = w1.dot(x) + b1
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1 in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2) in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)
loss = cross_E(y, y_hat)
print(f'loss before training is {loss} -- epoch number {epoch +
1}')
print('\n')
Now, before going to Backpropagation, I want you to notice two things.
First, the gamma(s) are analogous to weights. The only difference is they are simply used for multiplication one-on-one. When we will jump back through gamma(s), we will simply take gamma(s) in our gradients as we do in the case of the sigmoid, or ReLU.
For the first hidden layer
And, for the second hidden layer
Second, the beta(s) are analogous to biases. The gradients for beta(s) will be calculated in the same way as we did for biases.
If you have any doubts, then I suggest you go through every step of ‘Jumping Back’ with the help of the architecture diagram above.
#--------------Calculating gradients via Backpropagation------------
error_upto_softmax = np.sum(cross_E_grad(y, y_hat) *
softmax_dash(in_output_layer), axis = 0).reshape((-1, 1))
grad_w3 = error_upto_softmax .dot( out_hidden_2.T )
grad_b3 = error_upto_softmax
#------------------------------------
error_grad_H2 = np.sum(error_upto_softmax * w3, axis =
0).reshape((-1, 1))
grad_beta_H2 = error_grad_H2 * relu_dash(scaled_shifted_H2)
grad_gamma_H2 = error_grad_H2 * relu_dash(scaled_shifted_H2) *
normalized_H2
error_upto_normalize_H2 = np.sum(error_grad_H2 *
relu_dash(scaled_shifted_H2) * gamma_H2 *
normalize_dash(in_hidden_2),
axis = 0).reshape((-1, 1))
grad_w2 = error_upto_normalize_H2 .dot( out_hidden_1.T )
grad_b2 = error_upto_normalize_H2
#-----------------------------------
error_grad_H1 = np.sum(error_upto_normalize_H2 * w2, axis =
0).reshape((-1, 1))
grad_beta_H1 = error_grad_H1
grad_gamma_H1 = error_grad_H1 * normalized_H1
error_upto_normalize_H1 = np.sum(error_grad_H1 * gamma_H1 *
normalize_dash(activated_H1), axis = 0).reshape((-1, 1))
grad_w1 = error_upto_normalize_H1 * relu_dash(in_hidden_1) .dot(
x.T )
grad_b1 = error_upto_normalize_H1 * relu_dash(in_hidden_1)
Normalization Jacobian is analogous to the Softmax Jacobian, due to its definition.
Step 3.3 — Using SGD Optimizer to update weights, biases, gamma and betas
update_w1 = - learning_rate * grad_w1
w1 += update_w1 # w1
update_b1 = - learning_rate * grad_b1
b1 += update_b1 # b1
update_gamma_H1 = - learning_rate * grad_gamma_H1
gamma_H1 += update_gamma_H1 # gamma_H1
update_beta_H1 = - learning_rate * grad_beta_H1
beta_H1 += update_beta_H1 # beta_H1
update_w2 = - learning_rate * grad_w2
w2 += update_w2 # w2
update_b2 = - learning_rate * grad_b2
b2 += update_b2 # b2
update_gamma_H2 = - learning_rate * grad_gamma_H2
gamma_H2 += update_gamma_H2 # gamma_H2
update_beta_H2 = - learning_rate * grad_beta_H2
beta_H2 += update_beta_H2 # beta_H2
update_w3 = - learning_rate * grad_w3
w3 += update_w3 # w3
update_b3 = - learning_rate * grad_b3
b3 += update_b3 # b3
The training loop will run 200 times.
This is a small screenshot after the training.
Step 4— A forward feed to see how close predicted values are to true values
in_hidden_1 = w1.dot(x) + b1 # forward feed
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat # predicted valuesy # true valuescross_E(y, y_hat) # loss
I hope now you understand how to apply Layer Normalization in Neural Networks.
If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.
I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.
Many thanks for your support and feedback.
If you like this course, then you can support me at
It would mean a lot to me.