Layer Normalization applied on a Neural Network

Step by step implementation in Python

7 min readDec 19, 2021

In this post, we will see how to apply Layer Normalization which we studied in the last post.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.5.2 Layer Normalization, Part II

We will use two different layers for Normalization and scaling-shifting.

Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use SGD Optimizer with a learning rate = 0.01
Fifth, For the first hidden layer, the Layer Normalization will be applied after the ReLU activation function and for the second hidden layer, the Layer Normalization will be applied before the ReLU activation function.

‘g_H1’ and ‘b_H1’ represent and gamma(s) and beta(s) for the first hidden layer with elements ‘g_H1₁’, ‘g_H1₂’, …, ‘b_H1₁’… and ‘g_H2’ and ‘b_H2’ represent and gamma(s) and beta(s) for the second hidden layer with elements ‘g_H2₁’, ‘g_H2₂’, …, ‘b_H2₁’…

Now, let us have a look at the steps.

Step 1 - A forward feed like we did in the previous post with Layer
                                                       Normalization
Step 2 - Initializing SGD Optimizer
Step 3 - Entering the training loop
         Step 3.1 - Forward feed to see loss before training
         Step 3.2 - Calculating gradients via Backpropagation
         Step 3.3 - Updating weights, biases, gamma and betas via
                                                      SGD Optimizer
Step 4 - A forward feed with Layer Normalization to verify that the
         loss has been reduced and to see how close predicted values
         are to true values

Note — We will use the trained values of gammas and betas because they also store the general pattern or information.

Step 1 — Forward feed with Layer Normalization

import numpy as np                          # importing NumPy
np.random.seed(42)input_nodes = 5                             # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4

x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x                                       # Inputsy = np.array([[0], [1], [0], [0]])
y                                       # Outputs

def relu(x, leak = 0):                          # ReLU
    return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0):                     # ReLU derivative
    return np.where(x <= 0, leak, 1)def softmax(x):                                 # Softmax
    return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x):                            # Softmax derivative
    
    I = np.eye(x.shape[0])
    
    return softmax(x) * (I - softmax(x).T)def normalize(x):                              # Layer Normalization
    
    mean = x.mean(axis = 0)
    std = x.std(axis = 0)
    
    return (x - mean) / (std + 10**-100)def normalize_dash(x):                   # Normalization derivative
    
    N = x.shape[0]
    I = np.eye(N)
    mean = x.mean(axis = 0)
    std = x.std(axis = 0)
    
    return ((N * I - 1) / (N * std + 10**-100)) - (( (x - mean)
                     .dot((x - mean).T) ) / (N * std**3 + 10**-100))def cross_E(y_true, y_pred):                    # CE
    return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred):               # CE derivative
    return -y_true/(y_pred + 10**-100)

w1 = np.random.random(size = (hidden_1_nodes, input_nodes))     # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))                      # b1gamma_H1 = np.ones(shape = (hidden_1_nodes, 1))           # gamma_H1
beta_H1 = np.zeros(shape = (hidden_1_nodes, 1))           # beta_H1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes))  # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))                      # b2gamma_H2 = np.ones(shape = (hidden_2_nodes, 1))           # gamma_H2
beta_H2 = np.zeros(shape = (hidden_2_nodes, 1))           # beta_H2w3 = np.random.random(size = (output_nodes, hidden_2_nodes))    # w3
b3 = np.zeros(shape = (output_nodes, 1))                        # b3

in_hidden_1 = w1.dot(x) + b1                         # forward feed
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                                 # y_haty                                                     # ycross_E(y, y_hat)                                     # loss

Step 2 — Initializing SGD Optimizer

learning_rate = 0.01

Step 3— Enter training loop

epochs = 200

Step 3.1 — Forward feed to see loss before training

for epoch in range(epochs):
    
    #----------------forward feed to see loss-----------------------
    
    in_hidden_1 = w1.dot(x) + b1
    activated_H1 = relu(in_hidden_1)
    normalized_H1 = normalize(activated_H1)
    out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1    in_hidden_2 = w2.dot(out_hidden_1) + b2
    normalized_H2 = normalize(in_hidden_2)
    scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
    out_hidden_2 = relu(scaled_shifted_H2)    in_output_layer = w3.dot(out_hidden_2) + b3
    y_hat = softmax(in_output_layer)
    
    loss = cross_E(y, y_hat)
    
    print(f'loss before training is {loss} -- epoch number {epoch +
                                                                1}')
    print('\n')

Now, before going to Backpropagation, I want you to notice two things.
First, the gamma(s) are analogous to weights. The only difference is they are simply used for multiplication one-on-one. When we will jump back through gamma(s), we will simply take gamma(s) in our gradients as we do in the case of the sigmoid, or ReLU.

For the first hidden layer

And, for the second hidden layer

Second, the beta(s) are analogous to biases. The gradients for beta(s) will be calculated in the same way as we did for biases.

If you have any doubts, then I suggest you go through every step of ‘Jumping Back’ with the help of the architecture diagram above.

#--------------Calculating gradients via Backpropagation------------
    
    error_upto_softmax = np.sum(cross_E_grad(y, y_hat) *
           softmax_dash(in_output_layer), axis = 0).reshape((-1, 1))
    
    grad_w3 = error_upto_softmax .dot( out_hidden_2.T )
    
    grad_b3 = error_upto_softmax
    
    #------------------------------------
    
    error_grad_H2 = np.sum(error_upto_softmax * w3, axis =
                                                 0).reshape((-1, 1))
    
    grad_beta_H2 = error_grad_H2 * relu_dash(scaled_shifted_H2)
    
    grad_gamma_H2 = error_grad_H2 * relu_dash(scaled_shifted_H2) *
                                                       normalized_H2
    
    error_upto_normalize_H2 = np.sum(error_grad_H2 *
                           relu_dash(scaled_shifted_H2) * gamma_H2 *
                                        normalize_dash(in_hidden_2), 
                                         axis = 0).reshape((-1, 1))
    
    grad_w2 = error_upto_normalize_H2 .dot( out_hidden_1.T )
        
    grad_b2 = error_upto_normalize_H2
    
    #-----------------------------------
    
    error_grad_H1 = np.sum(error_upto_normalize_H2 * w2, axis =
                                                 0).reshape((-1, 1))
    
    grad_beta_H1 = error_grad_H1
    
    grad_gamma_H1 = error_grad_H1 * normalized_H1
    
    error_upto_normalize_H1 = np.sum(error_grad_H1 * gamma_H1 *
            normalize_dash(activated_H1), axis = 0).reshape((-1, 1))
    
    grad_w1 = error_upto_normalize_H1 * relu_dash(in_hidden_1) .dot(
                                                               x.T )
    
    grad_b1 = error_upto_normalize_H1 * relu_dash(in_hidden_1)

Normalization Jacobian is analogous to the Softmax Jacobian, due to its definition.

Step 3.3 — Using SGD Optimizer to update weights, biases, gamma and betas

    update_w1 = - learning_rate * grad_w1
    w1 += update_w1                                     # w1
    
    update_b1 = - learning_rate * grad_b1
    b1 += update_b1                                     # b1
    
    update_gamma_H1 = - learning_rate * grad_gamma_H1
    gamma_H1 += update_gamma_H1                         # gamma_H1
    
    update_beta_H1 = - learning_rate * grad_beta_H1
    beta_H1 += update_beta_H1                           # beta_H1
    
    update_w2 = - learning_rate * grad_w2
    w2 += update_w2                                     # w2
    
    update_b2 = - learning_rate * grad_b2
    b2 += update_b2                                     # b2
    
    update_gamma_H2 = - learning_rate * grad_gamma_H2
    gamma_H2 += update_gamma_H2                         # gamma_H2
    
    update_beta_H2 = - learning_rate * grad_beta_H2
    beta_H2 += update_beta_H2                           # beta_H2
    
    update_w3 = - learning_rate * grad_w3
    w3 += update_w3                                     # w3
    
    update_b3 = - learning_rate * grad_b3
    b3 += update_b3                                     # b3

The training loop will run 200 times.

This is a small screenshot after the training.

Step 4— A forward feed to see how close predicted values are to true values

in_hidden_1 = w1.dot(x) + b1                    # forward feed
activated_H1 = relu(in_hidden_1)
normalized_H1 = normalize(activated_H1)
out_hidden_1 = gamma_H1 * normalized_H1 + beta_H1in_hidden_2 = w2.dot(out_hidden_1) + b2
normalized_H2 = normalize(in_hidden_2)
scaled_shifted_H2 = gamma_H2 * normalized_H2 + beta_H2
out_hidden_2 = relu(scaled_shifted_H2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                           # predicted valuesy                                               # true valuescross_E(y, y_hat)                               # loss