Batch Training — Mean of gradients of a mini-batch used in Optimizers
Step by step implementation in Python
In this post, we will see how Batch training is done. The idea is to break the whole dataset into many mini-batches and then take the mean of gradients that we collected for each mini-batch for the training purpose.
You can download the Jupyter Notebook from here.
Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.
5.6 Batch Training
For demonstration purposes, we will have a dataset of 500 samples which we will break into 20 mini-batches. And each mini-batch will have 25 samples.
One thing to notice is that in each training loop we will shuffle the data so that the neural network doesn’t remember the repetitive pattern of ‘y’. You will know about the repetitive pattern when we will generate it.
Note — The architecture of the Neural Network is the same as it was in the previous post, i.e., 4 layers with 5, 3, 5, and 4 nodes.
A few things
First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function.
Third, the loss function used is Categorical cross-entropy loss, CE
Fourth, We will use RMSprop Optimizer with a learning rate = 0.01, rho = 0.9 and, epsilon = 10**-8
Now, let us take a look at the steps
Step 1 - Generating 500 random samples and a forward feed for any 1
sample
Step 2 - Initializing RMSprop Optimizer
Step 3 - Entering the training loop
Step 3.1 - Initializing avg_loss variable for current training
loop
Step 3.2 - Shuffling of the Dataset
Step 3.3 - Going through each batch
Step 3.3.1 - Initializing loss and gradient variables in which
we will store the values for current batch
Step 3.3.1.1 - Forward feed for the sample in current batch
Step 3.3.1.2 - Collecting loss and gradients
Step 3.3.2 - Updating weights and biases via RMSprop Optimizer
with the mean of gradients collected
Step 3.3.3 - Printing average loss of the batch
Step 3.4 - Printing average loss of the batches for the current
training loop
Step 4 - Forward feed for any 1 sample to see that loss has been
reduced
Step 1 — Generating 500 random samples and a forward feed for any 1 sample
import numpy as np # importing NumPy
np.random.seed(42)input_nodes = 5 # nodes in each layer
hidden_1_nodes = 3
hidden_2_nodes = 5
output_nodes = 4
N = 500 # number of sample
batch_size = 25 # batch size
number_batches = int(N/batch_size) # number of batches
x = np.random.randint(100, 500, size = (N, input_nodes, 1)) / 1000
x # Inputsy = []
for i in range(int(N/4)):
y.append([[1], [0], [0], [0]])
y.append([[0], [1], [0], [0]])
y.append([[0], [0], [1], [0]])
y.append([[0], [0], [0], [1]])
y = np.array(y).reshape((N, output_nodes, 1))
y # Outputs
Note —’ x’ and ‘y’ are 3-D tensors, as they are collections or a batch of 2-D tensors. We already discussed that we are keeping the layers vertical or in shape (-1, 1), i.e., a 2-D tensor.
def relu(x, leak = 0): # ReLU
return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0): # ReLU derivative
return np.where(x <= 0, leak, 1)def softmax(x): # Softmax
return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x): # Softmax derivative
I = np.eye(x.shape[0])
return softmax(x) * (I - softmax(x).T)def cross_E(y_true, y_pred): # CE
return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred): # CE derivative
return -y_true/(y_pred + 10**-100)
w1 = np.random.random(size = (hidden_1_nodes, input_nodes)) # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1)) # b1w2 = np.random.random(size = (hidden_2_nodes, hidden_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1)) # b2w3 = np.random.random(size = (output_nodes, hidden_2_nodes)) # w3
b3 = np.zeros(shape = (output_nodes, 1)) # b3
in_hidden_1 = w1.dot(x[0]) + b1 # forward feed
out_hidden_1 = relu(in_hidden_1) # for any 1 samplein_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat # predicted valuey[0] # true valuecross_E(y[0], y_hat) # CE loss
Step 2 — Initializing RMSprop Optimizer
learning_rate = 0.01 # leaning rate
rho = 0.9 # rho
epsilon = 10**-8 # epsilonaccumulator_w1 = np.zeros(w1.shape) # Initializing
# accumulator with 0
accumulator_b1 = np.zeros(b1.shape)accumulator_w2 = np.zeros(w2.shape)accumulator_b2 = np.zeros(b2.shape)accumulator_w3 = np.zeros(w3.shape)accumulator_b3 = np.zeros(b3.shape)
Step 3 — Entering the training loop
data = np.concatenate((x, y), axis = 1)
data.shape
epochs = 10
Step 3.1 — Initializing avg_loss variable for current training loop
Step 3.2 — Shuffling of the dataset
for epoch in range(epochs):
avg_loss = 0 # Step 3.1
np.random.shuffle(data) # Step 3.2
x = data[:, :5, :]
y = data[:, 5:, :]
Note — 5 is actually the number of nodes in the input layer.
If you are thinking about how I separated x and y, it is simple slicing. For example, for ‘x’ I am taking all tensors along axis = 0, first 5 tensors along axis = 1 and, all scalars along axis = 3.
Step 3.3 — Going through each batch
Step 3.3.1 — Initializing loss and gradient variables in which we will store for the current batch
for batch in range(number_batches): # Step 3.3
loss = 0 # Step 3.3.1
sample = batch * batch_size
grad_w3 = 0
grad_b3 = 0
grad_w2 = 0
grad_b2 = 0
grad_w1 = 0
grad_b1 = 0
Step 3.3.1.1 - Forward feed for the sample in the current batch
for iteration in range(batch_size): # Step 3.3.1.1
#-------------Forward propagation in batch--------------
in_hidden_1 = w1.dot(x[sample]) + b1
out_hidden_1 = relu(in_hidden_1) in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2) in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)
Step 3.3.1.2 - Collecting loss and gradients
Note — We are not storing average loss in the variable but loss. We will divide it by the batch size at the end of the loop to get the average loss.
#-----------Collecting loss and gradients------ loss += cross_E(y[sample], y_hat) # Step 3.3.1.2
#---------------------------------------
error_upto_softmax = np.sum(cross_E_grad(y[sample],
y_hat) * softmax_dash(in_output_layer), axis =
0).reshape((-1, 1))
grad_w3 += error_upto_softmax .dot( out_hidden_2.T ) grad_b3 += error_upto_softmax #----------------------------------------- error_grad_H2 = np.sum(error_upto_softmax * w3, axis =
0) .reshape((-1, 1)) grad_w2 += error_grad_H2 * relu_dash(in_hidden_2) .dot(
out_hidden_1.T ) grad_b2 += error_grad_H2 * relu_dash(in_hidden_2) #----------------------------------------- error_grad_H1 = np.sum(error_grad_H2 *
relu_dash(in_hidden_2) * w2, axis = 0) .reshape((-1, 1)) grad_w1 += error_grad_H1 * relu_dash(in_hidden_1) .dot(
x[sample].T ) grad_b1 += error_grad_H1 * relu_dash(in_hidden_1)
sample += 1
Step 3.3.2 - Updating weights and biases via RMSprop Optimizer with the mean of gradients collected
#---------Updating weights and biases with RMSprop----------
# Step 3.3.2
accumulator_w1 = rho * accumulator_w1 + (1 - rho) *
(grad_w1/batch_size)**2
update_w1 = - learning_rate * (grad_w1/batch_size) /
(accumulator_w1**0.5) + epsilon)
w1 += update_w1 # w1 accumulator_b1 = rho * accumulator_b1 + (1 - rho) *
(grad_b1/batch_size)**2
update_b1 = - learning_rate * (grad_b1/batch_size) /
(accumulator_b1**0.5 + epsilon)
b1 += update_b1 # b1 accumulator_w2 = rho * accumulator_w2 + (1 - rho) *
(grad_w2/batch_size)**2
update_w2 = - learning_rate * (grad_w2/batch_size) /
(accumulator_w2**0.5 + epsilon)
w2 += update_w2 # w2 accumulator_b2 = rho * accumulator_b2 + (1 - rho) *
(grad_b2/batch_size)**2
update_b2 = - learning_rate * (grad_b2/batch_size) /
(accumulator_b2**0.5 + epsilon)
b2 += update_b2 # b2 accumulator_w3 = rho * accumulator_w3 + (1 - rho) *
(grad_w3/batch_size)**2
update_w3 = - learning_rate * (grad_w3/batch_size) /
(accumulator_w3**0.5 + epsilon)
w3 += update_w3 # w3 accumulator_b3 = rho * accumulator_b3 + (1 - rho) *
(grad_b3/batch_size)**2
update_b3 = - learning_rate * (grad_b3/batch_size) /
(accumulator_b3**0.5 + epsilon)
b3 += update_b3 # b3
Step 3.3.3 - Printing average loss of the batch
#-----------------------------------------------------
avg_loss += loss/batch_size
print(f'average loss before training of batch number {batch
+ 1} is {loss/batch_size} -- epoch number {epoch + 1}')
print('\n') # Step 3.3.3
Step 3.4 - Printing average loss of the batches for current training loop
print('-------------------------')
print(f'average loss of batches is {avg_loss/number_batches} --
epoch number {epoch + 1}')
print('-------------------------')
print('\n') # Step 3.4
The training loop will run 10 times.
This is a small screenshot after the training.
Step 4 - Forward feed for any 1 sample to see that loss has been reduced
in_hidden_1 = w1.dot(x[0]) + b1 # forward feed
out_hidden_1 = relu(in_hidden_1) # for any 1 samplein_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_output_layer = w3.dot(out_hidden_2) + b3
y_hat = softmax(in_output_layer)y_hat # predicted valuey[0] # true valuecross_E(y[0], y_hat) # CE loss
Few things you must note
First, The data has been shuffled. So, the first sample is not the same first sample before the training and shuffling.
Second, since the data was randomly generated, there is no pattern and you can say that the Neural Network has given an equal probability of 0.25 to all 4 outputs. Things will be different in the last post of this chapter where we will see the UCI White Wine Quality Dataset.
Third, the loss was less for the very first sample before training because the randomly generated value at the first index for the first sample was the maximum. The loss increases because there is no pattern and all outputs are close to 0.25 giving an equal probability to each index. This will not be the case in the UCI White Wine Dataset because there the Neural Network will catch the pattern or general information.
I hope now you understand how Batch training works with mini-batches.
If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.
I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.