Backpropagation — Made super easy for you, Part 2
Step by step implementation in Python
In this post, we will go through Backpropagation with the Softmax function. By the end of this post, you will know how to implement Backpropagation with every activation function and every loss which we have seen in this course.
You can download the Jupyter Notebook from here.
Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.
5.2.2 Backpropagation in ANNs — Part 2
Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.
First, the activation function for the first hidden layer the Sigmoid function
Second, the activation function for the second hidden layer and the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use SGD with Momentum Optimizer with a learning rate = 0.01 and momentum = 0.9
Note — It is not recommended to use Softmax function in hidden layers. It will make the nodes linearly dependent. But, to show that Backpropagation is no big deal in such case, we are using it in the second hidden layer.
Now, let us look at the steps which we will do here
Step 1 - A forward feed like we did in the previous post
Step 2 - Initializing SGD with Momentum Optimizer
Step 3 - Entering the training loop
Step 3.1 - A forward feed to see loss before training
Step 3.2 - Using Backpropagation to calculate gradients
Step 3.3 - Using SGD with Momentum Optimizer to update weights
and biases
Step 4 - A forward feed to verify that the loss has been reduced
and to see how close predicted values are to true values
Let us do it in Python.
Step 1 — A forward feed like we did in the previous post
import numpy as np # importing NumPy
np.random.seed(42)input_nodes = 5 # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4
x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x # Inputsy = np.array([[0], [1], [0], [0]])
y # Outputs
def sig(x): # Sigmoid
return 1/(1 + np.exp(-x))def sig_dash(x): # Sigmoid derivative
return sig(x) * (1 - sig(x))def softmax(x): # Softmax
return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x): # Softmax derivative
I = np.eye(x.shape[0])
return softmax(x) * (I - softmax(x).T)def cross_E(y_true, y_pred): # CE
return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred): # CE derivative
return -y_true/(y_pred + 10**-100)
w1 = np.random.random(size = (hidden_1_nodes, input_nodes)) # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1)) # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1)) # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes)) # w3
b3 = np.zeros(shape = (output_nodes, 1)) # b3
in_hidden_1 = w1.dot(x) + b1 # forward feed
out_hidden_1 = sig(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = softmax(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat # y_haty # ycross_E(y, y_hat) # CE loss
Step 2 — Initializing SGD with Momentum Optimizer
learning_rate = 0.01 # learning rate
momentum = 0.9 # momentumupdate_w1 = np.zeros(w1.shape) # Initializing updates with 0update_b1 = np.zeros(b1.shape)update_w2 = np.zeros(w2.shape)update_b2 = np.zeros(b2.shape)update_w3 = np.zeros(w3.shape)update_b3 = np.zeros(b3.shape)
Step 3 — Entering training loop
epochs = 1000
Step 3.1 — A forward feed to see loss before training
We will print loss before training every time to see that it is reducing after each training epoch.
for epoch in range(epochs):#----------------------Forward Propagation--------------------------
in_hidden_1 = w1.dot(x) + b1
out_hidden_1 = sig(in_hidden_1) in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = softmax(in_hidden_2) in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)
loss = cross_E(y, y_hat)
print(f'loss before training is {loss} -- epoch number {epoch +
1}')
print('\n')
Step 3.2 — Calculating gradients via Backpropagation
You must be thinking about what is different in Backpropagation when we use the Softmax function. Everything thing is the same but there are a few extra terms. Let us see them by doing exactly what we did in the previous post.
We will start by talking about weight w3₁₁.
We have to calculate
Then we can update it with SGD with Momentum optimizer or every weight in w3 as we did in the previous post in matrix form.
We know that,
and,
So, we can write
We also know that,
This thing is different from the previous post.
So, we can write
We also know that,
So these terms are 0 (Zero)
We have
or,
Like this, we can find every term in grad_w3
The matrix will be very big, but I will show you it in reduced form.
Now you are wondering what is ‘error_upto_softmax’. If you remember Rule number 2 of the game ‘Jumping Back’, we have to take sum along axis = 0 and then we have to reshape it to (-1, 1) if gradients after the jump are not in shape (-1, 1)
Because after crossing the Softmax line, our gradients are not in the shape (-1, 1) that is because of the definition of the Softma derivative or Jacobian.
So, here is ‘error_upto_softmax’
You can try to broadcast the loss gradient with the first column of the Softmax jacobian and then sum it. It will be the same.
Let us see but in somewhat unclear form. After broadcasting, we have this
After taking sum we will have something like this
And, after reshaping it to (-1, 1) we will have this
These reshaped gradients in shape (-1, 1) will have dot product with the transpose of O_H2 and the term in the blue box is actually the big term which we calculate for w3₁₁, i.e.,
With the understanding of the Softmax function derivative or Jacobian in Backpropagation, let us find all the gradients with the help of the game ‘Jumping Back’.
The rules of the game are
Rule 1 - If we cross a line, then we have to include the gradients
Rule 2 - After a jump, if the shape of gradients is not (-1, 1)
then we will take sum along axis = 0 and then reshape it to
(-1, 1)
We start from true value ‘y’
As we go back we cross the loss line, so, in the gradient variables, we will have Categorical cross-entropy loss gradients.
Jumping back, we cross the softmax line. Because of the Jacobian of the Softmax function, we will take sum along axis = 0 and then reshape it to (-1, 1) after broadcasting.
Now we have reached weights w3 and bias b3. So, the gradients till now will be used to update b3
And, for weights w3, we will have dot product with whatever we have on the other ends of the weights.
#------------Gradient Calculations via Back Propagation------------- error_upto_softmax = np.sum(cross_E_grad(y, y_hat) *
softmax_dash(in_output_layer), axis = 0).reshape((-1, 1))
# error upto softmax grad_w3 = error_upto_softmax .dot( out_hidden_2.T ) # w3
grad_b3 = error_upto_softmax # b3
Now if we jump back, we cross the weights w3. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_OL with weights w3.
After jumping back again, we have crossed the Softmax line. We will again use Rule number 2 and store the gradients in shape (-1, 1) in ‘error_upto_softmax_H2’
Since we have reached weights w2 and biases b2, so, gradients till now will be used to update b2
And, for weights w2, we will have a dot product with whatever we have on the other end of the weights w2.
error_grad_upto_H2 = np.sum(error_upto_softmax * w3, axis = 0)
.reshape((-1, 1))
# error grad upto H2
error_upto_softmax_H2 = np.sum(error_grad_upto_H2 *
softmax_dash(in_hidden_2), axis = 0).reshape((-1, 1))
# error upto softmax H2 grad_w2 = error_upto_softmax_H2 .dot( out_hidden_1.T ) # grad w2 grad_b2 = error_upto_softmax_H2 # grad b2
Now, if we jump back, we cross the weights w2. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_H2 with weights w2.
Now, if we jump back, we will cross the sigmoid line, so, the gradients will have the sigmoid derivative.
Now we have reached weights w1 and biases b1. Gradients till now will be used to update b1.
And for weights w1, we will have a dot product with the transpose of whatever we have on the other end of the weights.
error_grad_upto_H1 = np.sum(error_upto_softmax_H2 * w2, axis =
0) .reshape((-1, 1))
# error grad upto H1 grad_w1 = error_grad_upto_H1 * sig_dash(in_hidden_1) .dot( x.T )
# grad w1 grad_b1 = error_grad_upto_H1 * sig_dash(in_hidden_1) # grad b1
Step 3.3 — Using SGD with Momentum Optimizer to update the weights and biases
#----------Updating weights and biases via SGD Momentum--------- update_w1 = - learning_rate * grad_w1 + momentum * update_w1
w1 += update_w1 # w1
update_b1 = - learning_rate * grad_b1 + momentum * update_b1
b1 += update_b1 # b1
update_w2 = - learning_rate * grad_w2 + momentum * update_w2
w2 += update_w2 # w2
update_b2 = - learning_rate * grad_b2 + momentum * update_b2
b2 += update_b2 # b2
update_w3 = - learning_rate * grad_w3 + momentum * update_w3
w3 += update_w3 # w3
update_b3 = - learning_rate * grad_b3 + momentum * update_b3
b3 += update_b3 # b3
The training loop will run 1,000 times.
This is a small screenshot after the training.
Step 4 — A forward feed to verify that the loss is reduced and to see how close predicted values are to true values
in_hidden_1 = w1.dot(x) + b1 # forward feed
out_hidden_1 = sig(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = softmax(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat # predicted valuesy # true valuescross_E(y, y_hat) # loss
This ends the Backpropagation. I hope that now Backpropagation is not that complicated for you.
If you are looking for Batch training then you will have to wait till the sixth post in this chapter.
If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.
I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.