Backpropagation — Made super easy for you, Part 1

Step by step implementation in Python

17 min readDec 11, 2021

This is the post you all have been waiting for. In this post, we will go through Backpropagation, the most complex thing in Deep Learning but it is actually very simple if done in an organized manner. After this post, you will never look at Backpropagation the same way you did before. I guarantee.

At the end of this post, we will learn to calculate gradients via a game called ‘Jumping Back’ which will have some rules. This game will help you to visualize how Backpropagation works and how simple and easy Backpropagation actually is.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.2.1 Backpropagation in ANNs — Part 1

In this post, we will learn how to use Backpropagation to calculate gradients which we will use to update weights and biases to reduce the loss via some Optimizer.

Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

Before going forward, a few things first:
First, the activation function for the first hidden layer is ReLU with leak = 0.1
Second, the activation function for the second hidden layer and the output layer is the Sigmoid function.
Third, the loss function used is Mean Square Error, MSE
Fourth, We will use SGD Optimizer with a learning rate = 0.01

Now, let us look at the steps which we will do here

Step 1 - A forward feed like we did in the previous post
Step 2 - Initializing SGD Optimizer
Step 3 - Entering the training loop
      Step 3.1 - A forward feed to see loss before training
      Step 3.2 - Using Backpropagation to calculate gradients
      Step 3.3 - Using SGD Optimizer to update weights and biases
Step 4 - A forward feed to verify that the loss has been reduced
         and to see how close predicted values are to true values

Let us do it in Python.

Step 1 — A forward feed like we did in the previous post

import numpy as np                          # importing NumPy
np.random.seed(42)input_nodes = 5                             # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4

x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x                                       # Inputsy = np.random.randint(1, 100, size = (output_nodes, 1)) / 100
y                                       # Outputs

This time along with the activation functions and the loss function, we will also define their derivatives.

def relu(x, leak = 0):                      # ReLU
    return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0):                 # ReLU derivative
    return np.where(x <= 0, leak, 1)def sig(x):                                 # Sigmoid
    return 1/(1 + np.exp(-x))def sig_dash(x):                            # Sigmoid derivative
    return sig(x) * (1 - sig(x))def mse(y_true, y_pred):                    # MSE
    return np.mean((y_true - y_pred)**2)def mse_grad(y_true, y_pred):               # MSE derivative
    
    N = y_true.shape[0]
    
    return -2*(y_true - y_pred)/N

w1 = np.random.random(size = (hidden_1_nodes, input_nodes))    # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))                     # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))                     # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes))   # w3
b3 = np.zeros(shape = (output_nodes, 1))                       # b3

in_hidden_1 = w1.dot(x) + b1                      # forward feed
out_hidden_1 = relu(in_hidden_1, leak = 0.1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = sig(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = sig(in_output_layer)y_hat                                             # y_haty                                                 # ymse(y, y_hat)                                     # MSE loss

Step 2 — Initializing SGD Optimizer

learning_rate = 0.01

Step 3 — Entering training loop

We will call the training loops ‘epochs’. We will have 10,000 trainig loops.

epochs = 10000

Step 3.1 — A forward feed to see loss before training

We will print loss before training every time to see that it is reducing after each training epoch.

for epoch in range(epochs):#----------------------Forward Propagation--------------------------
    
    in_hidden_1 = w1.dot(x) + b1
    out_hidden_1 = relu(in_hidden_1, leak = 0.1)    in_hidden_2 = w2.dot(out_hidden_1) + b2
    out_hidden_2 = sig(in_hidden_2)    in_output_layer = w3.dot(out_hidden_2) + b3
    y_hat = sig(in_output_layer)
    
    loss = mse(y, y_hat)
    print(f'loss before training is {loss} -- epoch number {epoch +
                                                                1}')
    print('\n')

Step 3.2 — Calculating gradients via Backpropagation

Now, the question is how to update weights and biases.

For example, take weight w3₁₁
If we can calculate

Then we can use SGD Optimizer to update it like this.

Or we can do better.

For every weight in w3, we can do this

It can be rearranged as

where,

Let us start by finding the first term, i.e.,

We know that,

and,

So we can write

We also know that,

So, we can write,

We also know that,

So these terms are 0 (Zero)

We have

or,

Like this, we can find every term in grad_w3

We can reduce it like this

Note — Circle with dot represents dot product

And, finally, we have

In a single line, we have calculated all the gradients for weights in w3. Backpropagation is very simple when done in an organized fashion.

Similarly, like this, we can calculate the gradients for b3

Or, for every bias in b3, we can do this

It can be rearranged as,

where,

Let us start by finding the first term, i.e.,

We know that,

and,

So we can write,

We also know that,

So, we can write,

We also know that,

So these terms are 0 (Zero)

We have

or,

Like this, we can find every term in grad_b3.

We can reduce it like this

And, finally, we have

Again, in a single line, we have calculated the gradients for b3.

    grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(
                                                    out_hidden_2.T )
                                                   # grad_w3 
               
    grad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer)
                                                   # grad_b3

We can develop a trick via a game we will call ‘Jumping Back’.

Suppose, we start from true value ‘y’

Now we jump back and we notice that we have crossed the loss line, so, now we have the loss gradient in the gradient variables.

As we jump back again, we now cross the activation function line, so, in the gradient variables, we will have the activation function derivative.

Now, we have reached the w3 weights and b3 biases, so, gradients till now will be used to update b3

And, for weights w3, we will have a dot product with the transpose of whatever we have on the other end of the weights.

Once again, we can see the gradients in Python

    grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(
                                                    out_hidden_2.T )
                                                   # grad_w3 
               
    grad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer)
                                                   # grad_b3

Now, let us talk about updating weights and biases in w2 and b2.

Suppose if we can sum all the gradients up to the output of the second hidden layer or up to w3 and b3 in shape (-1, 1), then we will have exactly the same situation after jumping the loss line.

We will call those gradients ‘error_grad_upto_H2’

As we jump back, we cross the activation function line, so, in the gradient variables, we will have the activation function derivative.

As you can see, we have reached w2 and b2, so the gradients till now will be used to update b2.

And, for weights w2, we will have dot product with whatever we have on the other end of weights w2.

Now let us try to do this in an analytical way.

The trick is the same. If we can calculate

Then, we can update it with SGD like this

Or, for every weight in w2, we can do this

It can be rearranged as

where,

Let us start by finding the first term, i.e.,

We know that,

and,

So we can write

We also know that,

So, we can write,

We know that,

So, we can write

we know that,

So, we can write

we know that,

So, these terms are 0 (Zero)

We have

From this

and this

we have,

or,

Like this, we can find every term in grad_w2 but I will not show you the monstrous matrix of grad_w2.

We will skip it and I will directly show you the reduced form, which is

You can see that the first term is from the first portion

The second term is from the second portion

and third or the last term is from the last portion

Now the question is how to calculate ‘error_grad_upto_H2’ or how to sum all the gradients up to w3 and b3 into shape (-1, 1)

Actually, we did it when we derived the reduced form of gradients for w2 and b2.

I will show it to you.

You can try to broadcast the loss gradient and the sigmoid derivative with the first column of w3 to get the first term in the gradient which we derived for w2₁₁.

It will be the same.

I will show you but not in somewhat clear form.

After broadcasting, we will have this

After taking sum along axis = 0 we will have something like this

And, after reshaping it to (-1, 1) we will have this

These reshaped gradients in shape (-1, 1) will have dot product with the transpose of O_H1 and the term in the blue box is actually the first term which we calculate for w2₁₁, i.e.,

Like this, we can calculate the gradients for b2. But, we will use the help of a game which we developed called ‘Jumping Back’ and we have seen how to do it.

So, gradients for w2 and b2 are

    error_grad_upto_H2 = np.sum(mse_grad(y, y_hat) *
         sig_dash(in_output_layer) * w3, axis = 0) .reshape((-1, 1))
                                               # error grad upto H2  
  
    grad_w2 = error_grad_upto_H2 * sig_dash(in_hidden_2) .dot(
                                                    out_hidden_1.T )
                                               # grad w2    grad_b2 = error_grad_upto_H2 * sig_dash(in_hidden_2)
                                               # grad b2

Now we will calculate gradients for w1 and b1, and for that, we will go through every step of the game ‘Jumping Back’.

But first, let us state some rules for the game ‘Jumping Back’

Rule 1 - If we cross a line, then we have to include the gradients
Rule 2 - After a jump, if the shape of gradients is not (-1, 1)
         then we will take sum along axis = 0 and then reshape it to
         (-1, 1)

We can see that Rule 2 is used when we jump back across the weights.

We have used rule 2 above.

So, let us calculate now the gradients for w1 and b1

We will start from the beginning for better understanding.

We start from true value ‘y’

As we jump back, we will have a loss gradient in our gradient variable because we cross the loss line

Now we jump back crossing the sigmoid line. So, the gradient will now also have a sigmoid derivative

Now, we have reached w3 and b3, so gradients till now are used to update b3.

and for weights w2, we will have dot product with whatever we have on the other end of weights.

    grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(
                                                    out_hidden_2.T )
                                                   # grad_w3 
               
    grad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer)
                                                   # grad_b3

Now if we jump back, we are crossing weights w3. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_OL with the weights w3.

As we jump back, we have crossed the sigmoid line so the gradients will have a sigmoid derivative.

Now we have reached w2 and b2, so gradients till now will be used to update b2.

And, for weights w2, we will have a dot product with whatever we have on the other end of the weights.

    error_grad_upto_H2 = np.sum(mse_grad(y, y_hat) *
         sig_dash(in_output_layer) * w3, axis = 0) .reshape((-1, 1))
                                               # error grad upto H2  
  
    grad_w2 = error_grad_upto_H2 * sig_dash(in_hidden_2) .dot(
                                                    out_hidden_1.T )
                                               # grad w2    grad_b2 = error_grad_upto_H2 * sig_dash(in_hidden_2)
                                               # grad b2

Now if we jump back, we are crossing weights w2. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_H2 with the weights w2.

Now if we jump back, then we cross the ReLU activation line, so the gradients will have ReLU derivative in them

Now we have reached w1 and b2, so gradients till now are used to update b1

and for weights w1, we will have a dot product with the transpose of whatever we have on the other end of weights.

    error_grad_upto_H1 = np.sum(error_grad_upto_H2 *
             sig_dash(in_hidden_2) * w2, axis = 0) .reshape((-1, 1))
                                               # error grad upto H1
    
    grad_w1 = error_grad_upto_H1 * relu_dash(in_hidden_1, leak =
                                                    0.1) .dot( x.T )
                                               # grad w1
    
    grad_b1 = error_grad_upto_H1 * relu_dash(in_hidden_1, leak =
                                                                0.1)
                                               # grad b1

Step 3.3 — Using SGD Optimizer to update the weights and biases

    update_w1 = - learning_rate * grad_w1
    w1 += update_w1                                  # w1
    
    update_b1 = - learning_rate * grad_b1
    b1 += update_b1                                  # b1
    
    update_w2 = - learning_rate * grad_w2
    w2 += update_w2                                  # w2
    
    update_b2 = - learning_rate * grad_b2
    b2 += update_b2                                  # b2
    
    update_w3 = - learning_rate * grad_w3
    w3 += update_w3                                  # w3
    
    update_b3 = - learning_rate * grad_b3
    b3 += update_b3                                  # b3

This training procedure will happen epoch times, i.e., 10,000 times because epochs = 10,000

This is a small screenshot after the training.

Step 4 — A forward feed to verify that the loss is reduced and to see how close predicted values are to true values

in_hidden_1 = w1.dot(x) + b1                      # forward feed
out_hidden_1 = relu(in_hidden_1, leak = 0.1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = sig(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = sig(in_output_layer)y_hat                                             # predicted valuesy                                                 # true valuesmse(y, y_hat)                                     # loss

And finally, we have learned how to implement Backpropagation. I hope now you understand.

So, now you can apply Backpropagation in your Neural Networks with any number of layers with any number of nodes with any activation layer except the Softmax function.

Things are different with the Softmax function, which we will see in the next post.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.