Dropout — Regularization technique that clicked in Geoffrey Hinton’s mind at a bank

Step by step implementation in Python

10 min readDec 19, 2021

Dropout is one of the most popular regularization techniques which is used in training Neural Networks.

In this technique, we shut some nodes in the hidden layers.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.4 Dropout

Shutting of some nodes means the output of those nodes is zero (0), done randomly in each training loop.

“I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting.” — Geoffrey Hinton

Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use SGD Optimizer with a learning rate = 0.01
Fifth, For the first hidden layer, the dropout will be applied after the ReLU activation function and for the second hidden layer, the dropout will be applied before the ReLU activation function.

Now, let us have a look at the steps.

Step 1 - A forward feed like we did in the previous post but without
         Dropout
Step 2 - Initializing SGD Optimizer
Step 3 - Initializing Dropout states
Step 4 - Entering the training loop
      Step 4.1 - A forward feed to see loss without dropout before
                 training
      Step 4.2 - A forward feed with dropout to train the current
                 dropout state
      Step 4.3 - Using Backpropagation to calculate gradients
      Step 4.4 - Using SGD Optimizer to update weights and biases
Step 5 - A forward feed without Dropout to verify that the loss has
         been reduced and to see how close predicted values are to
         true values

Now, you must be wondering what are Dropout states I am talking about?
It will be explained clearly when we reach step 3.

Step 1 — A forward feed without Dropout

import numpy as np                          # importing NumPy
np.random.seed(42)input_nodes = 5                             # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4

x = np.random.randint(1, 100, size = (input_nodes, 1)) / 100
x                                       # Inputsy = np.array([[0], [1], [0], [0]])
y                                       # Outputs

def relu(x, leak = 0):                          # ReLU
    return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0):                     # ReLU derivative
    return np.where(x <= 0, leak, 1)def softmax(x):                                 # Softmax
    return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x):                            # Softmax derivative
    
    I = np.eye(x.shape[0])
    
    return softmax(x) * (I - softmax(x).T)def cross_E(y_true, y_pred):                    # CE
    return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred):               # CE derivative
    return -y_true/(y_pred + 10**-100)

w1 = np.random.random(size = (hidden_1_nodes, input_nodes))    # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))                     # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))                     # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes))   # w3
b3 = np.zeros(shape = (output_nodes, 1))                       # b3

in_hidden_1 = w1.dot(x) + b1                 # forward feed
out_hidden_1 = relu(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                        # y_haty                                            # ycross_E(y, y_hat)                            # loss without Dropout

Step 2 — Initializing SGD Optimizer

learning_rate = 0.01

Step 3 — Initializing Dropout states

We know that in Dropout we shut some nodes, i.e., the output of the node is forced to Zero (0).

Let us have a look at it.

First, we decide how many nodes we want to shut and that is decided with a fraction. We will call it ‘drop_H1’ for the first hidden layer and set it to 1/3

What we mean is that for every training loop we will shut 1 node randomly in the first hidden layer.

Let us start by shutting the third node off.

Now, we can write the state of the above hidden layer like this in Python.

This is telling us that the third node is shut-off.

We will call it in Python as ‘drop_H1_state’ for the first hidden layer.

There are only three possible states for the first hidden layer.

And all of these states can be used randomly by ‘np.random.shuffle’ because this function shuffles the array along axis = 0

can give us anyone of the following three.

We also have to do one more thing. We will divide every scalar in the ‘drop_H1_state’ by (1 — drop_H1)

This is to keep the sum of scalars close to the sum of scalars in the layer before dropout.

So, finally, we have for the first hidden layer

drop_H1 = (1/3)                          # fraction of nodes to shutdrop_H1_state = np.array([[1], [1], [0]])           # Dropout state
                                                         initializeddrop_H1_statedrop_H1_state / (1 - drop_H1)          # multiplied by (1 - drop_H1)

Let us see, how it will be used in the Network.
Simple,

And for backpropagation, we notice that

Similarly, we can do this for the second hidden layer. In the second hidden layer, we will shut 2 nodes.

drop_H2 = (2/5)                          # fraction of nodes to shutdrop_H2_state = np.array([[1], [1], [1], [0], [0]])
drop_H2_state                            # Dropout state initializeddrop_H2_state / (1 - drop_H2)          # multiplied by (1 - drop_H2)

Step 4 — Enter training loop

epochs = 200

Step 4.1 — A forward feed without Dropout to see the loss

We will print loss before training every time to see that it is reducing after each training epoch.

for epoch in range(epochs):#-------------------------Forward Propagation-----------------------
    
    #-----------------loss of sample without dropout----------------
    
    in_hidden_1 = w1.dot(x) + b1
    out_hidden_1 = relu(in_hidden_1)    in_hidden_2 = w2.dot(out_hidden_1) + b2
    out_hidden_2 = relu(in_hidden_2)    in_output_layer = w3.dot(out_hidden_2) + b3
    y_hat = softmax(in_output_layer)
    
    loss = cross_E(y, y_hat)
    print(f'loss before training is {loss} -- epoch number {epoch +
                                                                1}')
    print('\n')

Step 4.2 — Forward feed with dropout to calculate gradients with nodes shut-off

#----------Forward propagation for training with dropout------------
    
    in_hidden_1 = w1.dot(x) + b1
    out_hidden_1 = relu(in_hidden_1)
    
    #-----------------first dropout after activation-----
    
    np.random.shuffle(drop_H1_state)          # random state for H1
    dropout_1 = out_hidden_1 * drop_H1_state / (1 - drop_H1)
                             # Dropout applied on first hidden layer    #----------------------------
    
    in_hidden_2 = w2.dot(dropout_1) + b2
    
    #---------------second dropout before actvation---------
    
    np.random.shuffle(drop_H2_state)          # random state for H2
    dropout_2 = in_hidden_2 * drop_H2_state / (1 - drop_H2)
                           # Dropout applied on second hidden layer
    
    #-------------------------------
    
    out_hidden_2 = relu(dropout_2)
    
    #--------------------------------    in_output_layer = w3.dot(out_hidden_2) + b3
    y_hat = softmax(in_output_layer)

Step 4.3 — Calculating gradients via Backpropagation

I suggest that you go through ‘Jumping Back’ which we saw in 5.2.1 — Backpropagation in ANNs, Part 1 to understand what we have done below. Believe me, all we did have is an extra term for dropout derivative which will be very clear if you use ‘Jumping Back’ with the diagram of the NN architecture above.

#-----------Gradient Calculations via Back Propagation--------------
    
    error_upto_softmax = np.sum(cross_E_grad(y, y_hat) *
           softmax_dash(in_output_layer), axis = 0).reshape((-1, 1))
    
    grad_w3 = error_upto_softmax .dot( out_hidden_2.T )
    
    grad_b3 = error_upto_softmax
    
    #-----------------------------------------
    
    error_grad_H2 = np.sum(error_upto_softmax * w3, axis = 0)
                                                   .reshape((-1, 1))
    
    grad_w2 = error_grad_H2 * relu_dash(dropout_2) * (drop_H2_state
                                / (1 - drop_H2)) .dot( dropout_1.T )
    
    grad_b2 = error_grad_H2 * relu_dash(dropout_2) * (drop_H2_state
                                                    / (1 - drop_H2))
    
    #-----------------------------------------
    
    error_grad_H1 = np.sum(error_grad_H2 * relu_dash(dropout_2) *
   (drop_H2_state / (1 - drop_H2)) * w2, axis = 0) .reshape((-1, 1))
    
    grad_w1 = error_grad_H1 * (drop_H1_state / (1 - drop_H1)) *
                                  relu_dash(in_hidden_1) .dot( x.T )
    
    grad_b1 = error_grad_H1 * (drop_H1_state / (1 - drop_H1)) *
                                              relu_dash(in_hidden_1)

Step 4.4 — Using SGD Optimizer to update weights and biases

#----------------Updating weights and biases with SGD---------------    update_w1 = - learning_rate * grad_w1
    w1 += update_w1                                     # w1
    
    update_b1 = - learning_rate * grad_b1
    b1 += update_b1                                     # b1
    
    update_w2 = - learning_rate * grad_w2
    w2 += update_w2                                     # w2
    
    update_b2 = - learning_rate * grad_b2
    b2 += update_b2                                     # b2
    
    update_w3 = - learning_rate * grad_w3
    w3 += update_w3                                     # w3
    
    update_b3 = - learning_rate * grad_b3
    b3 += update_b3                                     # b3

The training loop will run 200 times.

This is a small screenshot after the training.

Step 5 — A forward feed without dropout to see how close predicted values are to true values

in_hidden_1 = w1.dot(x) + b1                      # forward feed
out_hidden_1 = relu(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                             # predicted valuesy                                                 # true valuescross_E(y, y_hat)                                 # CE loss

Note — We do not use dropout when we predict values because using dropout outside training means dropping patterns or part of it which the network studied in the training. Dropout is only used in training to generalize patterns and avoid overfitting.

As a bonus let us take a look at how gradients were affected by using the Dropout.

drop_H1_state   # state for first hidden layer during last trainingdrop_H2_state   # state for second hidden layer during last training

This means that the first node is shut-off in the first hidden layer and the third and the fourth nodes are shut-offs in the second hidden layer.

At first, we have a situation like this.

Looking at grad_w1 and grad_b1 reveals

grad_w1grad_b1

We know that the first row in grad_w1 is for incoming weights in the first node of the first hidden layer. So, we have this

Looking at grad_w2 and grad_b2 reveals

grad_w2grad_b2

The first column in grad_w2 is for outgoing weights from the first node in the first hidden layer and the third and fourth rows are for incoming weights to the third and fourth nodes in the second hidden layer.

So, we have this

Looking at grad_w3 reveals

grad_w3

The third and fourth columns are for the outgoing weights from the third and fourth nodes in the second hidden layer. So, we have this

This was how the Neural Network was trained for the last training instance.

Notes
First, Every training loop will have different nodes shut-off.
Second, Using the sigmoid activation function before dropout in the second hidden layer will not behave in the same way as ReLU before dropout because the sigmoid of 0 is not 0. It is 0.5 whereas ReLU of 0 is 0.
Third, update variables will not have zero rows and columns like gradients because in Optimizers other than SGD, updates use some portion of previous updates.

I hope now you understand how to implement Dropout regularization in Neural Networks.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 5.5.1 Layer Normalization, Part I.

Dropout — Regularization technique that clicked in Geoffrey Hinton’s mind at a bank

Step by step implementation in Python

5.4 Dropout

Many thanks for your support and feedback.

Written by neuralthreads