Backpropagation — Made super easy for you, Part 1
Step by step implementation in Python
This is the post you all have been waiting for. In this post, we will go through Backpropagation, the most complex thing in Deep Learning but it is actually very simple if done in an organized manner. After this post, you will never look at Backpropagation the same way you did before. I guarantee.
At the end of this post, we will learn to calculate gradients via a game called ‘Jumping Back’ which will have some rules. This game will help you to visualize how Backpropagation works and how simple and easy Backpropagation actually is.
You can download the Jupyter Notebook from here.
Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.
5.2.1 Backpropagation in ANNs — Part 1
In this post, we will learn how to use Backpropagation to calculate gradients which we will use to update weights and biases to reduce the loss via some Optimizer.
Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.
Before going forward, a few things first:
First, the activation function for the first hidden layer is ReLU with leak = 0.1
Second, the activation function for the second hidden layer and the output layer is the Sigmoid function.
Third, the loss function used is Mean Square Error, MSE
Fourth, We will use SGD Optimizer with a learning rate = 0.01
Now, let us look at the steps which we will do here
Step 1 - A forward feed like we did in the previous post
Step 2 - Initializing SGD Optimizer
Step 3 - Entering the training loop
Step 3.1 - A forward feed to see loss before training
Step 3.2 - Using Backpropagation to calculate gradients
Step 3.3 - Using SGD Optimizer to update weights and biases
Step 4 - A forward feed to verify that the loss has been reduced
and to see how close predicted values are to true values
Let us do it in Python.
Step 1 — A forward feed like we did in the previous post
import numpy as np # importing NumPy
np.random.seed(42)input_nodes = 5 # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4
x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x # Inputsy = np.random.randint(1, 100, size = (output_nodes, 1)) / 100
y # Outputs
This time along with the activation functions and the loss function, we will also define their derivatives.
def relu(x, leak = 0): # ReLU
return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0): # ReLU derivative
return np.where(x <= 0, leak, 1)def sig(x): # Sigmoid
return 1/(1 + np.exp(-x))def sig_dash(x): # Sigmoid derivative
return sig(x) * (1 - sig(x))def mse(y_true, y_pred): # MSE
return np.mean((y_true - y_pred)**2)def mse_grad(y_true, y_pred): # MSE derivative
N = y_true.shape[0]
return -2*(y_true - y_pred)/N
w1 = np.random.random(size = (hidden_1_nodes, input_nodes)) # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1)) # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1)) # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes)) # w3
b3 = np.zeros(shape = (output_nodes, 1)) # b3
in_hidden_1 = w1.dot(x) + b1 # forward feed
out_hidden_1 = relu(in_hidden_1, leak = 0.1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = sig(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = sig(in_output_layer)y_hat # y_haty # ymse(y, y_hat) # MSE loss
Step 2 — Initializing SGD Optimizer
learning_rate = 0.01
Step 3 — Entering training loop
We will call the training loops ‘epochs’. We will have 10,000 trainig loops.
epochs = 10000
Step 3.1 — A forward feed to see loss before training
We will print loss before training every time to see that it is reducing after each training epoch.
for epoch in range(epochs):#----------------------Forward Propagation--------------------------
in_hidden_1 = w1.dot(x) + b1
out_hidden_1 = relu(in_hidden_1, leak = 0.1) in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = sig(in_hidden_2) in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = sig(in_output_layer)
loss = mse(y, y_hat)
print(f'loss before training is {loss} -- epoch number {epoch +
1}')
print('\n')
Step 3.2 — Calculating gradients via Backpropagation
Now, the question is how to update weights and biases.
For example, take weight w3₁₁
If we can calculate
Then we can use SGD Optimizer to update it like this.
Or we can do better.
For every weight in w3, we can do this
It can be rearranged as
where,
Let us start by finding the first term, i.e.,
We know that,
and,
So we can write
We also know that,
So, we can write,
We also know that,
So these terms are 0 (Zero)
We have
or,
Like this, we can find every term in grad_w3
We can reduce it like this
Note — Circle with dot represents dot product
And, finally, we have
In a single line, we have calculated all the gradients for weights in w3. Backpropagation is very simple when done in an organized fashion.
Similarly, like this, we can calculate the gradients for b3
Or, for every bias in b3, we can do this
It can be rearranged as,
where,
Let us start by finding the first term, i.e.,
We know that,
and,
So we can write,
We also know that,
So, we can write,
We also know that,
So these terms are 0 (Zero)
We have
or,
Like this, we can find every term in grad_b3.
We can reduce it like this
And, finally, we have
Again, in a single line, we have calculated the gradients for b3.
grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(
out_hidden_2.T )
# grad_w3
grad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer)
# grad_b3
We can develop a trick via a game we will call ‘Jumping Back’.
Suppose, we start from true value ‘y’
Now we jump back and we notice that we have crossed the loss line, so, now we have the loss gradient in the gradient variables.
As we jump back again, we now cross the activation function line, so, in the gradient variables, we will have the activation function derivative.
Now, we have reached the w3 weights and b3 biases, so, gradients till now will be used to update b3
And, for weights w3, we will have a dot product with the transpose of whatever we have on the other end of the weights.
Once again, we can see the gradients in Python
grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(
out_hidden_2.T )
# grad_w3
grad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer)
# grad_b3
Now, let us talk about updating weights and biases in w2 and b2.
Suppose if we can sum all the gradients up to the output of the second hidden layer or up to w3 and b3 in shape (-1, 1), then we will have exactly the same situation after jumping the loss line.
We will call those gradients ‘error_grad_upto_H2’
As we jump back, we cross the activation function line, so, in the gradient variables, we will have the activation function derivative.
As you can see, we have reached w2 and b2, so the gradients till now will be used to update b2.
And, for weights w2, we will have dot product with whatever we have on the other end of weights w2.
Now let us try to do this in an analytical way.
The trick is the same. If we can calculate
Then, we can update it with SGD like this
Or, for every weight in w2, we can do this
It can be rearranged as
where,
Let us start by finding the first term, i.e.,
We know that,
and,
So we can write
We also know that,
So, we can write,
We know that,
So, we can write
we know that,
So, we can write
we know that,
So, these terms are 0 (Zero)
We have
From this
and this
we have,
or,
Like this, we can find every term in grad_w2 but I will not show you the monstrous matrix of grad_w2.
We will skip it and I will directly show you the reduced form, which is
You can see that the first term is from the first portion
The second term is from the second portion
and third or the last term is from the last portion
Now the question is how to calculate ‘error_grad_upto_H2’ or how to sum all the gradients up to w3 and b3 into shape (-1, 1)
Actually, we did it when we derived the reduced form of gradients for w2 and b2.
I will show it to you.
You can try to broadcast the loss gradient and the sigmoid derivative with the first column of w3 to get the first term in the gradient which we derived for w2₁₁.
It will be the same.
I will show you but not in somewhat clear form.
After broadcasting, we will have this
After taking sum along axis = 0 we will have something like this
And, after reshaping it to (-1, 1) we will have this
These reshaped gradients in shape (-1, 1) will have dot product with the transpose of O_H1 and the term in the blue box is actually the first term which we calculate for w2₁₁, i.e.,
Like this, we can calculate the gradients for b2. But, we will use the help of a game which we developed called ‘Jumping Back’ and we have seen how to do it.
So, gradients for w2 and b2 are
error_grad_upto_H2 = np.sum(mse_grad(y, y_hat) *
sig_dash(in_output_layer) * w3, axis = 0) .reshape((-1, 1))
# error grad upto H2
grad_w2 = error_grad_upto_H2 * sig_dash(in_hidden_2) .dot(
out_hidden_1.T )
# grad w2 grad_b2 = error_grad_upto_H2 * sig_dash(in_hidden_2)
# grad b2
Now we will calculate gradients for w1 and b1, and for that, we will go through every step of the game ‘Jumping Back’.
But first, let us state some rules for the game ‘Jumping Back’
Rule 1 - If we cross a line, then we have to include the gradients
Rule 2 - After a jump, if the shape of gradients is not (-1, 1)
then we will take sum along axis = 0 and then reshape it to
(-1, 1)
We can see that Rule 2 is used when we jump back across the weights.
We have used rule 2 above.
So, let us calculate now the gradients for w1 and b1
We will start from the beginning for better understanding.
We start from true value ‘y’
As we jump back, we will have a loss gradient in our gradient variable because we cross the loss line
Now we jump back crossing the sigmoid line. So, the gradient will now also have a sigmoid derivative
Now, we have reached w3 and b3, so gradients till now are used to update b3.
and for weights w2, we will have dot product with whatever we have on the other end of weights.
grad_w3 = mse_grad(y, y_hat) * sig_dash(in_output_layer) .dot(
out_hidden_2.T )
# grad_w3
grad_b3 = mse_grad(y, y_hat) * sig_dash(in_output_layer)
# grad_b3
Now if we jump back, we are crossing weights w3. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_OL with the weights w3.
As we jump back, we have crossed the sigmoid line so the gradients will have a sigmoid derivative.
Now we have reached w2 and b2, so gradients till now will be used to update b2.
And, for weights w2, we will have a dot product with whatever we have on the other end of the weights.
error_grad_upto_H2 = np.sum(mse_grad(y, y_hat) *
sig_dash(in_output_layer) * w3, axis = 0) .reshape((-1, 1))
# error grad upto H2
grad_w2 = error_grad_upto_H2 * sig_dash(in_hidden_2) .dot(
out_hidden_1.T )
# grad w2 grad_b2 = error_grad_upto_H2 * sig_dash(in_hidden_2)
# grad b2
Now if we jump back, we are crossing weights w2. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_H2 with the weights w2.
Now if we jump back, then we cross the ReLU activation line, so the gradients will have ReLU derivative in them
Now we have reached w1 and b2, so gradients till now are used to update b1
and for weights w1, we will have a dot product with the transpose of whatever we have on the other end of weights.
error_grad_upto_H1 = np.sum(error_grad_upto_H2 *
sig_dash(in_hidden_2) * w2, axis = 0) .reshape((-1, 1))
# error grad upto H1
grad_w1 = error_grad_upto_H1 * relu_dash(in_hidden_1, leak =
0.1) .dot( x.T )
# grad w1
grad_b1 = error_grad_upto_H1 * relu_dash(in_hidden_1, leak =
0.1)
# grad b1
Step 3.3 — Using SGD Optimizer to update the weights and biases
update_w1 = - learning_rate * grad_w1
w1 += update_w1 # w1
update_b1 = - learning_rate * grad_b1
b1 += update_b1 # b1
update_w2 = - learning_rate * grad_w2
w2 += update_w2 # w2
update_b2 = - learning_rate * grad_b2
b2 += update_b2 # b2
update_w3 = - learning_rate * grad_w3
w3 += update_w3 # w3
update_b3 = - learning_rate * grad_b3
b3 += update_b3 # b3
This training procedure will happen epoch times, i.e., 10,000 times because epochs = 10,000
This is a small screenshot after the training.
Step 4 — A forward feed to verify that the loss is reduced and to see how close predicted values are to true values
in_hidden_1 = w1.dot(x) + b1 # forward feed
out_hidden_1 = relu(in_hidden_1, leak = 0.1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = sig(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = sig(in_output_layer)y_hat # predicted valuesy # true valuesmse(y, y_hat) # loss
And finally, we have learned how to implement Backpropagation. I hope now you understand.
So, now you can apply Backpropagation in your Neural Networks with any number of layers with any number of nodes with any activation layer except the Softmax function.
Things are different with the Softmax function, which we will see in the next post.
If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.
I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.
Many thanks for your support and feedback.
If you like this course, then you can support me at
It would mean a lot to me.
Continue to the next post — 5.2.2 Backpropagation in ANNs — Part 2.