Backpropagation — Made super easy for you, Part 2

Step by step implementation in Python

11 min readDec 12, 2021

In this post, we will go through Backpropagation with the Softmax function. By the end of this post, you will know how to implement Backpropagation with every activation function and every loss which we have seen in this course.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.2.2 Backpropagation in ANNs — Part 2

Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

First, the activation function for the first hidden layer the Sigmoid function
Second, the activation function for the second hidden layer and the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use SGD with Momentum Optimizer with a learning rate = 0.01 and momentum = 0.9

Note — It is not recommended to use Softmax function in hidden layers. It will make the nodes linearly dependent. But, to show that Backpropagation is no big deal in such case, we are using it in the second hidden layer.

Now, let us look at the steps which we will do here

Step 1 - A forward feed like we did in the previous post
Step 2 - Initializing SGD with Momentum Optimizer
Step 3 - Entering the training loop
      Step 3.1 - A forward feed to see loss before training
      Step 3.2 - Using Backpropagation to calculate gradients
      Step 3.3 - Using SGD with Momentum Optimizer to update weights
                 and biases
Step 4 - A forward feed to verify that the loss has been reduced
         and to see how close predicted values are to true values

Let us do it in Python.

Step 1 — A forward feed like we did in the previous post

import numpy as np                          # importing NumPy
np.random.seed(42)input_nodes = 5                             # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4

x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x                                       # Inputsy = np.array([[0], [1], [0], [0]])
y                                       # Outputs

def sig(x):                                     # Sigmoid  
    return 1/(1 + np.exp(-x))def sig_dash(x):                                # Sigmoid derivative  
    return sig(x) * (1 - sig(x))def softmax(x):                                 # Softmax
    return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x):                            # Softmax derivative
    
    I = np.eye(x.shape[0])
    
    return softmax(x) * (I - softmax(x).T)def cross_E(y_true, y_pred):                    # CE
    return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred):               # CE derivative
    return -y_true/(y_pred + 10**-100)

w1 = np.random.random(size = (hidden_1_nodes, input_nodes))    # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))                     # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))                     # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes))   # w3
b3 = np.zeros(shape = (output_nodes, 1))                       # b3

in_hidden_1 = w1.dot(x) + b1                     # forward feed
out_hidden_1 = sig(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = softmax(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                            # y_haty                                                # ycross_E(y, y_hat)                                # CE loss

Step 2 — Initializing SGD with Momentum Optimizer

learning_rate = 0.01                   # learning rate
momentum = 0.9                         # momentumupdate_w1 = np.zeros(w1.shape)         # Initializing updates with 0update_b1 = np.zeros(b1.shape)update_w2 = np.zeros(w2.shape)update_b2 = np.zeros(b2.shape)update_w3 = np.zeros(w3.shape)update_b3 = np.zeros(b3.shape)

Step 3 — Entering training loop

epochs = 1000

Step 3.1 — A forward feed to see loss before training

We will print loss before training every time to see that it is reducing after each training epoch.

for epoch in range(epochs):#----------------------Forward Propagation--------------------------
    
    in_hidden_1 = w1.dot(x) + b1
    out_hidden_1 = sig(in_hidden_1)    in_hidden_2 = w2.dot(out_hidden_1) + b2
    out_hidden_2 = softmax(in_hidden_2)    in_output_layer = w3.dot(out_hidden_2) + b3
    y_hat = softmax(in_output_layer)
    
    loss = cross_E(y, y_hat)
    print(f'loss before training is {loss} -- epoch number {epoch +
                                                                1}')
    print('\n')

Step 3.2 — Calculating gradients via Backpropagation

You must be thinking about what is different in Backpropagation when we use the Softmax function. Everything thing is the same but there are a few extra terms. Let us see them by doing exactly what we did in the previous post.

We will start by talking about weight w3₁₁.
We have to calculate

Then we can update it with SGD with Momentum optimizer or every weight in w3 as we did in the previous post in matrix form.

We know that,

and,

So, we can write

We also know that,

This thing is different from the previous post.

So, we can write

We also know that,

So these terms are 0 (Zero)

We have

or,

Like this, we can find every term in grad_w3

The matrix will be very big, but I will show you it in reduced form.

Now you are wondering what is ‘error_upto_softmax’. If you remember Rule number 2 of the game ‘Jumping Back’, we have to take sum along axis = 0 and then we have to reshape it to (-1, 1) if gradients after the jump are not in shape (-1, 1)

Because after crossing the Softmax line, our gradients are not in the shape (-1, 1) that is because of the definition of the Softma derivative or Jacobian.

So, here is ‘error_upto_softmax’

You can try to broadcast the loss gradient with the first column of the Softmax jacobian and then sum it. It will be the same.

Let us see but in somewhat unclear form. After broadcasting, we have this

After taking sum we will have something like this

And, after reshaping it to (-1, 1) we will have this

These reshaped gradients in shape (-1, 1) will have dot product with the transpose of O_H2 and the term in the blue box is actually the big term which we calculate for w3₁₁, i.e.,

With the understanding of the Softmax function derivative or Jacobian in Backpropagation, let us find all the gradients with the help of the game ‘Jumping Back’.

The rules of the game are

Rule 1 - If we cross a line, then we have to include the gradients
Rule 2 - After a jump, if the shape of gradients is not (-1, 1)
         then we will take sum along axis = 0 and then reshape it to
         (-1, 1)

We start from true value ‘y’

As we go back we cross the loss line, so, in the gradient variables, we will have Categorical cross-entropy loss gradients.

Jumping back, we cross the softmax line. Because of the Jacobian of the Softmax function, we will take sum along axis = 0 and then reshape it to (-1, 1) after broadcasting.

Now we have reached weights w3 and bias b3. So, the gradients till now will be used to update b3

And, for weights w3, we will have dot product with whatever we have on the other ends of the weights.

#------------Gradient Calculations via Back Propagation-------------    error_upto_softmax = np.sum(cross_E_grad(y, y_hat) *
           softmax_dash(in_output_layer), axis = 0).reshape((-1, 1))
                                               # error upto softmax    grad_w3 = error_upto_softmax .dot( out_hidden_2.T )     # w3
    
    grad_b3 = error_upto_softmax                            # b3

Now if we jump back, we cross the weights w3. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_OL with weights w3.

After jumping back again, we have crossed the Softmax line. We will again use Rule number 2 and store the gradients in shape (-1, 1) in ‘error_upto_softmax_H2’

Since we have reached weights w2 and biases b2, so, gradients till now will be used to update b2

And, for weights w2, we will have a dot product with whatever we have on the other end of the weights w2.

    error_grad_upto_H2 = np.sum(error_upto_softmax * w3, axis = 0)
                                                   .reshape((-1, 1))
                                               # error grad upto H2
    
    error_upto_softmax_H2 = np.sum(error_grad_upto_H2 *
               softmax_dash(in_hidden_2), axis = 0).reshape((-1, 1))
                                            # error upto softmax H2    grad_w2 = error_upto_softmax_H2 .dot( out_hidden_1.T ) # grad w2    grad_b2 = error_upto_softmax_H2                        # grad b2

Now, if we jump back, we cross the weights w2. We will sum all the gradients up to here in shape (-1, 1) after broadcasting the gradients up to I_H2 with weights w2.

Now, if we jump back, we will cross the sigmoid line, so, the gradients will have the sigmoid derivative.

Now we have reached weights w1 and biases b1. Gradients till now will be used to update b1.

And for weights w1, we will have a dot product with the transpose of whatever we have on the other end of the weights.

    error_grad_upto_H1 = np.sum(error_upto_softmax_H2 * w2, axis =
                                                0) .reshape((-1, 1))
                                                # error grad upto H1    grad_w1 = error_grad_upto_H1 * sig_dash(in_hidden_1) .dot( x.T )
                                                          # grad w1    grad_b1 = error_grad_upto_H1 * sig_dash(in_hidden_1)  # grad b1

Step 3.3 — Using SGD with Momentum Optimizer to update the weights and biases

    #----------Updating weights and biases via SGD Momentum---------    update_w1 = - learning_rate * grad_w1 + momentum * update_w1
    w1 += update_w1                                  # w1
    
    update_b1 = - learning_rate * grad_b1 + momentum * update_b1
    b1 += update_b1                                  # b1
    
    update_w2 = - learning_rate * grad_w2 + momentum * update_w2
    w2 += update_w2                                  # w2
    
    update_b2 = - learning_rate * grad_b2 + momentum * update_b2
    b2 += update_b2                                  # b2
    
    update_w3 = - learning_rate * grad_w3 + momentum * update_w3
    w3 += update_w3                                  # w3
    
    update_b3 = - learning_rate * grad_b3 + momentum * update_b3
    b3 += update_b3                                  # b3

The training loop will run 1,000 times.

This is a small screenshot after the training.

Step 4 — A forward feed to verify that the loss is reduced and to see how close predicted values are to true values

in_hidden_1 = w1.dot(x) + b1                      # forward feed
out_hidden_1 = sig(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = softmax(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                             # predicted valuesy                                                 # true valuescross_E(y, y_hat)                                 # loss

This ends the Backpropagation. I hope that now Backpropagation is not that complicated for you.

If you are looking for Batch training then you will have to wait till the sixth post in this chapter.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 5.3 L1 and L2 regularization.

Backpropagation — Made super easy for you, Part 2

Step by step implementation in Python

5.2.2 Backpropagation in ANNs — Part 2

Many thanks for your support and feedback.

Written by neuralthreads