Batch Training — Mean of gradients of a mini-batch used in Optimizers

Step by step implementation in Python

9 min readDec 25, 2021

In this post, we will see how Batch training is done. The idea is to break the whole dataset into many mini-batches and then take the mean of gradients that we collected for each mini-batch for the training purpose.

You can download the Jupyter Notebook from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.6 Batch Training

For demonstration purposes, we will have a dataset of 500 samples which we will break into 20 mini-batches. And each mini-batch will have 25 samples.

One thing to notice is that in each training loop we will shuffle the data so that the neural network doesn’t remember the repetitive pattern of ‘y’. You will know about the repetitive pattern when we will generate it.

Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.

A few things

First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use RMSprop Optimizer with a learning rate = 0.01, rho = 0.9 and, epsilon = 10**-8

Now, let us take a look at the steps

Step 1 - Generating 500 random samples and a forward feed for any 1
         sample
Step 2 - Initializing RMSprop Optimizer
Step 3 - Entering the training loop
   Step 3.1 - Initializing avg_loss variable for current training
              loop
   Step 3.2 - Shuffling of the Dataset
   Step 3.3 - Going through each batch
      Step 3.3.1 - Initializing loss and gradient variables in which
                   we will store the values for current batch
         Step 3.3.1.1 - Forward feed for the sample in current batch
         Step 3.3.1.2 - Collecting loss and gradients
      Step 3.3.2 - Updating weights and biases via RMSprop Optimizer
                   with the mean of gradients collected
      Step 3.3.3 - Printing average loss of the batch
   Step 3.4 - Printing average loss of the batches for the current
              training loop
Step 4 - Forward feed for any 1 sample to see that loss has been
         reduced

Step 1 — Generating 500 random samples and a forward feed for any 1 sample

import numpy as np                          # importing NumPy
np.random.seed(42)input_nodes = 5                             # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4

N = 500                                  # number of sample
batch_size = 25                          # batch size
number_batches = int(N/batch_size)       # number of batches

x = np.random.randint(100, 500, size = (N, input_nodes, 1)) / 1000
x                                       # Inputsy = []
for i in range(int(N/4)):
    y.append([[1], [0], [0], [0]])
    y.append([[0], [1], [0], [0]])
    y.append([[0], [0], [1], [0]])
    y.append([[0], [0], [0], [1]])
y = np.array(y).reshape((N, output_nodes, 1))
y                                      # Outputs

Note —’ x’ and ‘y’ are 3-D tensors, as they are collections or a batch of 2-D tensors. We already discussed that we are keeping the layers vertical or in shape (-1, 1), i.e., a 2-D tensor.

def relu(x, leak = 0):                          # ReLU
    return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0):                     # ReLU derivative
    return np.where(x <= 0, leak, 1)def softmax(x):                                 # Softmax
    return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x):                            # Softmax derivative
    
    I = np.eye(x.shape[0])
    
    return softmax(x) * (I - softmax(x).T)def cross_E(y_true, y_pred):                    # CE
    return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred):               # CE derivative
    return -y_true/(y_pred + 10**-100)

w1 = np.random.random(size = (hidden_1_nodes, input_nodes))     # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))                      # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes))  # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))                      # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes))    # w3
b3 = np.zeros(shape = (output_nodes, 1))                        # b3

in_hidden_1 = w1.dot(x[0]) + b1                   # forward feed
out_hidden_1 = relu(in_hidden_1)                  # for any 1 samplein_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                             # predicted valuey[0]                                              # true valuecross_E(y[0], y_hat)                              # CE loss

Step 2 — Initializing RMSprop Optimizer

learning_rate = 0.01                         # leaning rate
rho = 0.9                                    # rho
epsilon = 10**-8                             # epsilonaccumulator_w1 = np.zeros(w1.shape)          # Initializing
                                             # accumulator with 0
accumulator_b1 = np.zeros(b1.shape)accumulator_w2 = np.zeros(w2.shape)accumulator_b2 = np.zeros(b2.shape)accumulator_w3 = np.zeros(w3.shape)accumulator_b3 = np.zeros(b3.shape)

Step 3 — Entering the training loop

data = np.concatenate((x, y), axis = 1)
data.shape

epochs = 10

Step 3.1 — Initializing avg_loss variable for current training loop
Step 3.2 — Shuffling of the dataset

for epoch in range(epochs):
    
    avg_loss = 0                                     # Step 3.1
    
    np.random.shuffle(data)                          # Step 3.2
    x = data[:, :5, :]
    y = data[:, 5:, :]

Note — 5 is actually the number of nodes in the input layer.

If you are thinking about how I separated x and y, it is simple slicing. For example, for ‘x’ I am taking all tensors along axis = 0, first 5 tensors along axis = 1 and, all scalars along axis = 3.

Step 3.3 — Going through each batch
Step 3.3.1 — Initializing loss and gradient variables in which we will store for the current batch

    for batch in range(number_batches):              # Step 3.3
        
        loss = 0                                     # Step 3.3.1
        
        sample = batch * batch_size
        
        grad_w3 = 0
        grad_b3 = 0
        
        grad_w2 = 0
        grad_b2 = 0
        
        grad_w1 = 0
        grad_b1 = 0

Step 3.3.1.1 - Forward feed for the sample in the current batch

        for iteration in range(batch_size):           # Step 3.3.1.1
            
            #-------------Forward propagation in batch--------------
            
            in_hidden_1 = w1.dot(x[sample]) + b1
            out_hidden_1 = relu(in_hidden_1)            in_hidden_2 = w2.dot(out_hidden_1) + b2
            out_hidden_2 = relu(in_hidden_2)            in_output_layer = w3.dot(out_hidden_2) + b3
            y_hat = softmax(in_output_layer)

Step 3.3.1.2 - Collecting loss and gradients

Note — We are not storing average loss in the variable but loss. We will divide it by the batch size at the end of the loop to get the average loss.

            #-----------Collecting loss and gradients------            loss += cross_E(y[sample], y_hat)        # Step 3.3.1.2
            
            #---------------------------------------
            
            error_upto_softmax = np.sum(cross_E_grad(y[sample],
                y_hat) * softmax_dash(in_output_layer), axis =
                                            0).reshape((-1, 1))
    
            grad_w3 += error_upto_softmax .dot( out_hidden_2.T )            grad_b3 += error_upto_softmax            #-----------------------------------------            error_grad_H2 = np.sum(error_upto_softmax * w3, axis =
                                              0) .reshape((-1, 1))            grad_w2 += error_grad_H2 * relu_dash(in_hidden_2) .dot(
                                                   out_hidden_1.T )            grad_b2 += error_grad_H2 * relu_dash(in_hidden_2)            #-----------------------------------------            error_grad_H1 = np.sum(error_grad_H2 *
            relu_dash(in_hidden_2) * w2, axis = 0) .reshape((-1, 1))            grad_w1 += error_grad_H1 * relu_dash(in_hidden_1) .dot(
                                                      x[sample].T )            grad_b1 += error_grad_H1 * relu_dash(in_hidden_1)
            
            sample += 1

Step 3.3.2 - Updating weights and biases via RMSprop Optimizer with the mean of gradients collected

        #---------Updating weights and biases with RMSprop----------
                                                       # Step 3.3.2
        
        accumulator_w1 = rho * accumulator_w1 + (1 - rho) *
                                    (grad_w1/batch_size)**2
        update_w1 = - learning_rate * (grad_w1/batch_size) /
                            (accumulator_w1**0.5) + epsilon)
        w1 += update_w1                                 # w1        accumulator_b1 = rho * accumulator_b1 + (1 - rho) *
                                    (grad_b1/batch_size)**2
        update_b1 = - learning_rate * (grad_b1/batch_size) /
                             (accumulator_b1**0.5 + epsilon)
        b1 += update_b1                                 # b1        accumulator_w2 = rho * accumulator_w2 + (1 - rho) *
                                    (grad_w2/batch_size)**2
        update_w2 = - learning_rate * (grad_w2/batch_size) /
                             (accumulator_w2**0.5 + epsilon)
        w2 += update_w2                                 # w2        accumulator_b2 = rho * accumulator_b2 + (1 - rho) *
                                    (grad_b2/batch_size)**2
        update_b2 = - learning_rate * (grad_b2/batch_size) /
                             (accumulator_b2**0.5 + epsilon)
        b2 += update_b2                                 # b2        accumulator_w3 = rho * accumulator_w3 + (1 - rho) *
                                    (grad_w3/batch_size)**2
        update_w3 = - learning_rate * (grad_w3/batch_size) /
                             (accumulator_w3**0.5 + epsilon)
        w3 += update_w3                                 # w3        accumulator_b3 = rho * accumulator_b3 + (1 - rho) *
                                    (grad_b3/batch_size)**2
        update_b3 = - learning_rate * (grad_b3/batch_size) /
                             (accumulator_b3**0.5 + epsilon)
        b3 += update_b3                                 # b3

Step 3.3.3 - Printing average loss of the batch

        #-----------------------------------------------------
        
        avg_loss += loss/batch_size
        
        print(f'average loss before training of batch number {batch
            + 1} is {loss/batch_size} -- epoch number {epoch + 1}')
        print('\n')                             # Step 3.3.3

Step 3.4 - Printing average loss of the batches for current training loop

    print('-------------------------')    
    print(f'average loss of batches is {avg_loss/number_batches} --
                                          epoch number {epoch + 1}')
    print('-------------------------')
    print('\n')                                   # Step 3.4

The training loop will run 10 times.

This is a small screenshot after the training.

Step 4 - Forward feed for any 1 sample to see that loss has been reduced

in_hidden_1 = w1.dot(x[0]) + b1                   # forward feed
out_hidden_1 = relu(in_hidden_1)                  # for any 1 samplein_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat                                             # predicted valuey[0]                                              # true valuecross_E(y[0], y_hat)                              # CE loss

Few things you must note

First, The data has been shuffled. So, the first sample is not the same first sample before the training and shuffling.
Second, since the data was randomly generated, there is no pattern and you can say that the Neural Network has given an equal probability of 0.25 to all 4 outputs. Things will be different in the last post of this chapter where we will see the UCI White Wine Quality Dataset.
Third, the loss was less for the very first sample before training because the randomly generated value at the first index for the first sample was the maximum. The loss increases because there is no pattern and all outputs are close to 0.25 giving an equal probability to each index. This will not be the case in the UCI White Wine Dataset because there the Neural Network will catch the pattern or general information.

I hope now you understand how Batch training works with mini-batches.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.