Validation Set — To validate our model and for early stopping to avoid Overfitting

Step by step implementation in Python

10 min readDec 25, 2021

In this post, we will talk about the benefits of using the Validation set. They are used to seeing how well our model is getting trained and when do we need to stop the training to avoid Overfitting. It will be illustrated at the end of the post.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.7 Validation Set

For demonstration purposes, we will have a dataset of 500 samples. Generally, 20% of the dataset is used as a Validation set but you can use a different number if you like. We will use the last 20% of samples from the dataset as our Validation set. Then we will break the training set into 16 mini-batches each with 25 samples.

It is to be noted that we will keep the Validation set the same throughout the training, i.e., we will first split the dataset into training and validation sets and then shuffle the training set for training purposes.

Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

A few things

First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use RMSprop Optimizer with a learning rate = 0.01, rho = 0.9 and, epsilon = 10**-8

Now, let us take a look at the steps

Step 1 - Generating 500 random samples, splitting into training and
         validation set, and a forward feed for any 1 training
         sample
Step 2 - Initializing RMSprop Optimizer
Step 3 - Entering the training loop and a list for storing training
         and validation loss for each epoch
   Step 3.1 - Initializing avg_loss variable for current training
              loop
   Step 3.2 - Shuffling of the Dataset
   Step 3.3 - Going through each batch
      Step 3.3.1 - Initializing loss and gradient variables in which
                   we will store the values for current batch
         Step 3.3.1.1 - Forward feed for the sample in current batch
         Step 3.3.1.2 - Collecting loss and gradients
      Step 3.3.2 - Updating weights and biases via RMSprop Optimizer
                   with the mean of gradients collected
      Step 3.3.3 - Printing average loss of the batch
   Step 3.4 - Printing average loss of the batches for the current
              training loop and storing it in the list
   Step 3.5 - Collecting average loss of samples in the Validation
              set
Step 4 - Forward feed for any 1 sample to see that loss has been
         reduced

You can see that every step is the same as we did in the previous post, i.e., Batch Training except this time we have an extra Step 3.5 for the Validation set.

Step 1 - Generating 500 random samples, splitting into training and validation set, and a forward feed for any 1 training sample

import numpy as np                          # importing NumPy
np.random.seed(42)input_nodes = 5                             # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4

N = 500                                        # number of sampleval_split = 0.2                                # validation splittrain_N = int(N * (1 - val_split))
train_N                                    # samples in training settrain_val = int(N * val_split)
train_val                                # samples in validation setbatch_size = 25                                # batch size
number_batches = int(train_N/batch_size)       # number of batches

x = np.random.randint(100, 500, size = (N, input_nodes, 1)) / 1000
x                                       # Inputsy = []
for i in range(int(N/4)):
    y.append([[1], [0], [0], [0]])
    y.append([[0], [1], [0], [0]])
    y.append([[0], [0], [1], [0]])
    y.append([[0], [0], [0], [1]])
y = np.array(y).reshape((N, output_nodes, 1))
y                                       # Outputs

train_x = x[:train_N]                           # Training set X
train_y = y[:train_N]                           # Training set Yval_set_x = x[train_N:]                         # Validation set X
val_set_y = y[train_N:]                         # Validation set Y

def relu(x, leak = 0):                          # ReLU
    return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0):                     # ReLU derivative
    return np.where(x <= 0, leak, 1)def softmax(x):                                 # Softmax
    return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x):                            # Softmax derivative
    
    I = np.eye(x.shape[0])
    
    return softmax(x) * (I - softmax(x).T)def cross_E(y_true, y_pred):                    # CE
    return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred):               # CE derivative
    return -y_true/(y_pred + 10**-100)

w1 = np.random.random(size = (hidden_1_nodes, input_nodes))     # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))                      # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes))  # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))                      # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes))    # w3
b3 = np.zeros(shape = (output_nodes, 1))                        # b3

in_hidden_1 = w1.dot(train_x[0]) + b1             # forward feed
out_hidden_1 = relu(in_hidden_1)                  # for any 1 samplein_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                             # predicted valuetrain_y[0]                                        # true valuecross_E(train_y[0], y_hat)                        # CE loss

Step 2 — Initializing RMSprop Optimizer

learning_rate = 0.01                         # leaning rate
rho = 0.9                                    # rho
epsilon = 10**-8                             # epsilonaccumulator_w1 = np.zeros(w1.shape)          # Initializing
                                             # accumulator with 0
accumulator_b1 = np.zeros(b1.shape)accumulator_w2 = np.zeros(w2.shape)accumulator_b2 = np.zeros(b2.shape)accumulator_w3 = np.zeros(w3.shape)accumulator_b3 = np.zeros(b3.shape)

Step 3 - Entering the training loop and a list for storing training and validation loss for each epoch

data = np.concatenate((train_x,train_y), axis = 1)
data.shape

epochs = 10

loss_accumulator = []        # list for average loss for each epoch
val_loss_accumulator = []    # list for validation set loss

Step 3.1 - Initializing avg_loss variable for current training loop
Step 3.2 - Shuffling of the Dataset

for epoch in range(epochs):
    
    avg_loss = 0                                     # Step 3.1
    
    np.random.shuffle(data)                          # Step 3.2
    train_x = data[:, :5, :]
    train_y = data[:, 5:, :]

Step 3.3 - Going through each batch
Step 3.3.1 - Initializing loss and gradient variables in which we will store the values for the current batch

    for batch in range(number_batches):              # Step 3.3
        
        loss = 0                                     # Step 3.3.1
        
        sample = batch * batch_size
        
        grad_w3 = 0
        grad_b3 = 0
        
        grad_w2 = 0
        grad_b2 = 0
        
        grad_w1 = 0
        grad_b1 = 0

Step 3.3.1.1 - Forward feed for the sample in the current batch

            for iteration in range(batch_size):      # Step 3.3.1.1
            
            #------------Forward propagation in batch-------
            
            in_hidden_1 = w1.dot(train_x[sample]) + b1
            out_hidden_1 = relu(in_hidden_1)            in_hidden_2 = w2.dot(out_hidden_1) + b2
            out_hidden_2 = relu(in_hidden_2)            in_output_layer = w3.dot(out_hidden_2) + b3
            y_hat = softmax(in_output_layer)

Step 3.3.1.2 - Collecting loss and gradients

Note — We are not storing average loss in the variable but loss. We will divide it by the batch size at the end of the loop to get the average loss.

            #------------Collecting loss and gradients------            loss += cross_E(train_y[sample], y_hat)   # Step 3.3.1.2
            
            #---------------------------------------
            
            error_upto_softmax =
            np.sum(cross_E_grad(train_y[sample], y_hat) *
            softmax_dash(in_output_layer),axis = 0).reshape((-1, 1))
    
            grad_w3 += error_upto_softmax .dot( out_hidden_2.T )            grad_b3 += error_upto_softmax            #-----------------------------------------            error_grad_H2 = np.sum(error_upto_softmax * w3, axis =
                                              0) .reshape((-1, 1))            grad_w2 += error_grad_H2 * relu_dash(in_hidden_2) .dot(
                                                    out_hidden_1.T )            grad_b2 += error_grad_H2 * relu_dash(in_hidden_2)            #-----------------------------------------            error_grad_H1 = np.sum(error_grad_H2 *
            relu_dash(in_hidden_2) * w2, axis = 0) .reshape((-1, 1))            grad_w1 += error_grad_H1 * relu_dash(in_hidden_1) .dot(
                                                 train_x[sample].T )            grad_b1 += error_grad_H1 * relu_dash(in_hidden_1)
            
            sample += 1

Step 3.3.2 - Updating weights and biases via RMSprop Optimizer with the mean of gradients collected

        #---------Updating weights and biases with RMSprop----------
                                                       # Step 3.3.2
        
        accumulator_w1 = rho * accumulator_w1 + (1 - rho) *
                                    (grad_w1/batch_size)**2
        update_w1 = - learning_rate * (grad_w1/batch_size) /
                            (accumulator_w1**0.5) + epsilon)
        w1 += update_w1                                 # w1                
        
        accumulator_b1 = rho * accumulator_b1 + (1 - rho) *
                                    (grad_b1/batch_size)**2
        update_b1 = - learning_rate * (grad_b1/batch_size) /
                             (accumulator_b1**0.5 + epsilon)
        b1 += update_b1                                 # b1                accumulator_w2 = rho * accumulator_w2 + (1 - rho) *
                                    (grad_w2/batch_size)**2
        update_w2 = - learning_rate * (grad_w2/batch_size) /
                             (accumulator_w2**0.5 + epsilon)
        w2 += update_w2                                 # w2                accumulator_b2 = rho * accumulator_b2 + (1 - rho) *
                                    (grad_b2/batch_size)**2
        update_b2 = - learning_rate * (grad_b2/batch_size) /
                             (accumulator_b2**0.5 + epsilon)
        b2 += update_b2                                 # b2                accumulator_w3 = rho * accumulator_w3 + (1 - rho) *
                                    (grad_w3/batch_size)**2
        update_w3 = - learning_rate * (grad_w3/batch_size) /
                             (accumulator_w3**0.5 + epsilon)
        w3 += update_w3                                 # w3                accumulator_b3 = rho * accumulator_b3 + (1 - rho) *
                                    (grad_b3/batch_size)**2
        update_b3 = - learning_rate * (grad_b3/batch_size) /
                             (accumulator_b3**0.5 + epsilon)
        b3 += update_b3                                 # b3

Step 3.3.3 - Printing average loss of the batch

        #-----------------------------------------------------
        
        avg_loss += loss/batch_size
        
        print(f'average loss before training of batch number {batch
            + 1} is {loss/batch_size} -- epoch number {epoch + 1}')
        print('\n')                             # Step 3.3.3

Step 3.4 - Printing average loss of the batches for the current training loop and storing it in the list

    print('-------------------------')    
    print(f'average loss of batches is {avg_loss/number_batches} --
                                          epoch number {epoch + 1}')
    loss_accumulator.append(avg_loss/number_batches)
    print('-------------------------')                # Step 3.4

Step 3.5 - Collecting average loss of samples in the Validation set

    val_loss = 0    for val in range(train_val):
        in_hidden_1 = w1.dot(val_set_x[val]) + b1
        out_hidden_1 = relu(in_hidden_1)        in_hidden_2 = w2.dot(out_hidden_1) + b2
        out_hidden_2 = relu(in_hidden_2)        in_output_layer = w3.dot(out_hidden_2) + b3
        y_hat = softmax(in_output_layer)
            
        #------------------Collecting validation loss-------------        val_loss += cross_E(val_set_y[val], y_hat)        # Step 3.5    print(f'average loss of validation set is {val_loss/(val + 1)} 
                                      - epoch number {epoch + 1}')
    val_loss_accumulator.append(val_loss/(val + 1))
    print('-------------------------')
    print('\n')

The training loop will run 10 times.

This is a small screenshot after the training.

Step 4 - Forward feed for any 1 sample to see that loss has been reduced

in_hidden_1 = w1.dot(train_x[0]) + b1             # forward feed
out_hidden_1 = relu(in_hidden_1)                  # for any 1 samplein_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                             # predicted valuetrain_y[0]                                        # true valuecross_E(train_y[0], y_hat)                        # CE loss

It is to be noted that the dataset here is randomly generated so the explanation in the previous post also applies here.

Now let us see how we can use the Validation set for early stopping. For that, we will plot a graph using the lists in which we stored the average losses and validation set losses.

The y-axis will represent loss and the x-axis will represent epoch number.

import matplotlib.pyplot as pltplt.rcParams.update({'font.size': 22})EPOCH = [i for i in range(1, epochs + 1)]            # x-axisfig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)ax = plt.axes()ax.plot(EPOCH, loss_accumulator, label='Training loss')
ax.plot(EPOCH, val_loss_accumulator, label='Validation loss')
                                                     # y-axis
ax.set_title('Training and Validation loss')
ax.legend()
plt.show()

We can see that 8 epochs are enough for training here.

Suppose we run the training loop for 100 epochs, then we will have this graph.

From this, we can say that the Neural Network is actually memorizing the data (Rote learning)and not generalizing it. That is why the training loss is getting less and less whereas the validation loss is increasing. Because the training has nothing to do with the Validation set. By stopping the training at 8 or 10 epochs we can prevent overfitting.

So, I hope now you understand what is the purpose of the Validation set.

With this post, we have covered almost everything which is needed to learn about MLPs or simple ANNs in Deep Learning. Now we will take a look at the UCI White Wine Quality dataset. Working on the UCI dataset will be a good exercise to finish the chapter. We will see about the Accuracy and Confusion Matrix in that post.

But before that, let us take a look at Multiple Inputs and Multiple Outputs, i.e., Multiple Input Layers and Multiple Output Layers.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post — 5.8 Multiple Inputs and Multiple Outputs.

Validation Set — To validate our model and for early stopping to avoid Overfitting

Step by step implementation in Python

5.7 Validation Set

Many thanks for your support and feedback.

Written by neuralthreads