Multiple Inputs & Multiple Outputs in a Neural Network
Step by step implementation in Python
In this post, we will see how to apply Backpropagaton to train the Neural Network which has Multiple Inputs and Multiple Outputs. This is equivalent to the functional API of Keras.
You can download the Jupyter Notebook from here.
Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.
5.8 Multiple Inputs & Multiple Outputs
We will use a different NN architecture for this post. Let us have a look at the architecture.
We can see that first, we have the first input layer. Then 2 parallel hidden layers. Then we concatenated the outputs of 2 hidden layers with the second input layer, which we will call ‘Concatenated Layer’. Then the third hidden layer. And finally 2 output layers.
A few things,
First, the activation function for the hidden layers is the ReLU function
Second, the activation function for the first output layer is the Sigmoid function and for the second output layer is the Softmax function
Third, we will use Binary Cross-entropy Error, BCE as loss function for the first output layer and Mean Absolute Error, MAE as loss function for the second output layer
Fourth, we will use SGD Optimizer with a learning rate = 0.01
Note — You can use some different architecture if you want. All the steps are same. It is the type of problem which will make you think what architecture should I use.
Now, let us take a look at the steps
Step 1 - A forward feed to calculate loss
Step 2 - Initializing SGD Optimizer
Step 3 - Entering training loop
Step 3.1 - Forward feed to calculate losses
Step 3.2 - Calculate gradients using Backpropagation
Step 3.3 - Using SGD Optimizer to update weights and biases
Step 4 - A forward feed to verify that loss has been reduced and to
see how close predicted values are to true values
Step 1 - A forward feed to calculate loss
import numpy as np # importing NumPy
np.random.seed(42)input_1_nodes = 5 # nodes in each layerinput_2_nodes = 3
hidden_1_nodes = 4
hidden_2_nodes = 3concatenated_nodes = input_2_nodes + hidden_1_nodes + hidden_2_nodeshidden_3_nodes = 4output_1_nodes = 5
output_2_nodes = 4
x1 = np.random.randint(1, 100, size = (input_1_nodes, 1)) / 100
x1 # Input 1x2 = np.random.randint(1, 100, size = (input_2_nodes, 1)) / 100
x2 # Input 2y1 = np.array([[1], [1], [0], [0], [1]])
y1 # Output 1y2 = np.array([[0], [1], [0], [0]])
y2 # Output 2
def relu(x, leak = 0): # ReLU
return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0): # ReLU derivative
return np.where(x <= 0, leak, 1)def sig(x): # Sigmoid
return 1/(1 + np.exp(-x))def sig_dash(x): # Sogmoid derivative
return sig(x) * (1 - sig(x))def softmax(x): # Softmax
return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x): # Softmax derivative
I = np.eye(x.shape[0])
return softmax(x) * (I - softmax(x).T)def B_cross_E(y_true, y_pred): # BCE
return -np.mean(y_true * np.log(y_pred + 10**-100) + (1 -
y_true) * np.log(1 - y_pred + 10**-100))def B_cross_E_grad(y_true, y_pred): # BCE derivative
N = y_true.shape[0]
return -(y_true/(y_pred + 10**-100) - (1 - y_true)/(1 - y_pred +
10**-100))/Ndef mae(y_true, y_pred): # MAE
return np.mean(abs(y_true - y_pred))def mae_grad(y_true, y_pred): # MAE derivative
N = y_true.shape[0]
return -((y_true - y_pred) / (abs(y_true - y_pred) +
10**-100))/N
w1 = np.random.random(size = (hidden_1_nodes, input_1_nodes)) # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1)) # b1w2 = np.random.random(size = (hidden_2_nodes, input_1_nodes)) # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1)) # b2w3 = np.random.random(size = (hidden_3_nodes, concatenated_nodes))
b3 = np.zeros(shape = (hidden_3_nodes, 1)) # w3, b3w4 = np.random.random(size = (output_1_nodes, hidden_3_nodes)) # w4
b4 = np.zeros(shape = (output_1_nodes, 1)) # b4w5 = np.random.random(size = (output_2_nodes, hidden_3_nodes)) # w5
b5 = np.zeros(shape = (output_2_nodes, 1)) # b5
Now let us take a look at the outputs of each layer
in_hidden_1 = w1.dot(x1) + b1
out_hidden_1 = relu(in_hidden_1)print('out_hidden_1')
print(out_hidden_1, out_hidden_1.shape)
print('\n')
in_hidden_2 = w2.dot(x1) + b2
out_hidden_2 = relu(in_hidden_2)print('out_hidden_2')
print(out_hidden_2, out_hidden_2.shape)
print('\n')
concatenated_layer = np.concatenate((x2, out_hidden_1, out_hidden_2))print('concatenated_layer')
print(concatenated_layer, concatenated_layer.shape)
print('\n')
in_hidden_3 = w3.dot(concatenated_layer) + b3
out_hidden_3 = relu(in_hidden_3)print('out_hidden_3')
print(out_hidden_3, out_hidden_3.shape)
print('\n')
in_output_layer_1 = w4.dot(out_hidden_3) + b4
y1_hat = sig(in_output_layer_1)print('y1_hat')
print(y1_hat, y1_hat.shape)
print('\n')
in_output_layer_2 = w5.dot(out_hidden_3) + b5
y2_hat = softmax(in_output_layer_2)print('y2_hat')
print(y2_hat, y2_hat.shape)
print('\n')
y1 # y_hat 1y2 # y_hat 2B_cross_E(y1, y1_hat) + mae(y2, y2_hat) # total lossB_cross_E(y1, y1_hat) # loss 1mae(y2, y2_hat) # loss 2
Step 2 — Initializing SGD Optimizer
learning_rate = 0.01
Step 3 — Entering the training loop
epochs = 1000
Step 3.1 — Forward feed to calculate loss before training
for epoch in range(epochs):
#------------------------Forward Propagation--------------------
in_hidden_1 = w1.dot(x1) + b1
out_hidden_1 = relu(in_hidden_1) in_hidden_2 = w2.dot(x1) + b2
out_hidden_2 = relu(in_hidden_2) concatenated_layer = np.concatenate((x2, out_hidden_1,
out_hidden_2)) in_hidden_3 = w3.dot(concatenated_layer) + b3
out_hidden_3 = relu(in_hidden_3) in_output_layer_1 = w4.dot(out_hidden_3) + b4
y1_hat = sig(in_output_layer_1) in_output_layer_2 = w5.dot(out_hidden_3) + b5
y2_hat = softmax(in_output_layer_2)
loss = B_cross_E(y1, y1_hat) + mae(y2, y2_hat)
print(f'loss before training is {loss}')
print(f'output_1 loss is {B_cross_E(y1, y1_hat)}')
print(f'output_2 loss is {mae(y2, y2_hat)}-- epoch number {epoch
+ 1}')
print('\n')
Step 3.2 — Calculate gradients using Backpropagation
We will use the rules of ‘Jumping Back’ to calculate the gradients. But this time we will do additional steps.
Let us begin with the losses.
We are calculating gradients for (w4, b4) and (w5, b5) separately because (w4, b4) and (w5, b5) are independent of each other.
#-----------Gradient Calculations via Back Propagation---------- grad_w4 = B_cross_E_grad(y1, y1_hat) * # grad_w4
sig_dash(in_output_layer_1) .dot( out_hidden_3.T )
grad_b4 = B_cross_E_grad(y1, y1_hat) * # grad_b4
sig_dash(in_output_layer_1)
#-------------------------------------------
error_upto_softmax = np.sum(mae_grad(y2, y2_hat) *
softmax_dash(in_output_layer_2), axis = 0).reshape((-1, 1))
grad_w5 = error_upto_softmax .dot( out_hidden_3.T ) # grad_w5
grad_b5 = error_upto_softmax # grad_b5
Now, when we jump back across w4 and w5, we will add gradients after reshaping them to (-1, 1) and will proceed further
error_grad_H3 = ( np.sum(B_cross_E_grad(y1, y1_hat) *
sig_dash(in_output_layer_1) * w4, axis = 0).reshape((-1, 1))
+
np.sum(error_upto_softmax * w5, axis = 0).reshape((-1, 1)) )
grad_w3 = error_grad_H3 * relu_dash(in_hidden_3) .dot(
concatenated_layer.T ) # grad_w3
grad_b3 = error_grad_H3 * relu_dash(in_hidden_3) # grad_b3
Now how to calculate gradients for (w1, b1) and (w2, b2).
If you have noticed, now we have nothing to do with ‘x2’. Why? Because ‘x2’ is connected to hidden layer 3 via w3 and b3 between concatenated layer and hidden layer 3.
The shape of w3 is (hidden_3_nodes, concatenated_nodes), i.e., (4, 10)
The weights matrix w3 is
In these weights, there are connections between components of concatenated layer and input of the hidden 3 layer.
The weights in the blue box above connect input 2, i.e., ‘x2’ with hidden 3 ‘I_H3’. Shape (hidden_3_nodes, input_2_nodes) or (4, 3)
The weights in the blue box above connect the output of hidden 1, i.e., ‘O_H1’ with hidden 3 ‘I_H3’. Shape (hidden_3_nodes, hidden_1_nodes) or (4, 4)
The weights in the blue box above connect the output of hidden 2, i.e., ‘O_H2’ with hidden 3 ‘I_H3’. Shape (hidden_3_nodes, hidden_2_nodes) or (4, 3)
To calculate gradients for (w1, b1) and (w2, b2), we will use only the second and third blue box sub-matrices.
error_grad_H1 = np.sum(error_grad_H3 * relu_dash(in_hidden_3) *
w3[:, input_2_nodes: input_2_nodes + hidden_1_nodes],
axis = 0).reshape((-1, 1))
grad_w1 = error_grad_H1 * relu_dash(in_hidden_1) .dot( x1.T )
# grad_w1
grad_b1 = error_grad_H1 * relu_dash(in_hidden_1) # grad_b1
#-------------------------------------------
error_grad_H2 = np.sum(error_grad_H3 * relu_dash(in_hidden_3) *
w3[:, input_2_nodes + hidden_1_nodes:
input_2_nodes + hidden_1_nodes + hidden_2_nodes],
axis = 0).reshape((-1, 1))
grad_w2 = error_grad_H2 * relu_dash(in_hidden_2) .dot( x1.T )
# grad_w2
grad_b2 = error_grad_H2 * relu_dash(in_hidden_2) # grad_b2
Step 3.3 — Using SGD Optimizer to update weights and biases
#-------------Updating weights and biases with SGD-------------- update_w1 = - learning_rate * grad_w1
w1 += update_w1 # w1
update_b1 = - learning_rate * grad_b1
b1 += update_b1 # b1
update_w2 = - learning_rate * grad_w2
w2 += update_w2 # w2
update_b2 = - learning_rate * grad_b2
b2 += update_b2 # b2
update_w3 = - learning_rate * grad_w3
w3 += update_w3 # w3
update_b3 = - learning_rate * grad_b3
b3 += update_b3 # b3
update_w4 = - learning_rate * grad_w4
w4 += update_w4 # w4
update_b4 = - learning_rate * grad_b4
b4 += update_b4 # b4
update_w5 = - learning_rate * grad_w5
w5 += update_w5 # w5
update_b5 = - learning_rate * grad_b5
b5 += update_b5 # b5
The training loop will run 1,000 times.
This is a small screenshot after the training.
Step 4 — A forward feed to verify that loss has been reduced and to see how close predicted values are to true values
in_hidden_1 = w1.dot(x1) + b1 # forward feed
out_hidden_1 = relu(in_hidden_1) in_hidden_2 = w2.dot(x1) + b2
out_hidden_2 = relu(in_hidden_2)concatenated_layer = np.concatenate((x2, out_hidden_1, out_hidden_2))in_hidden_3 = w3.dot(concatenated_layer) + b3
out_hidden_3 = relu(in_hidden_3)in_ouput_layer_1 = w4.dot(out_hidden_3) + b4
y1_hat = sig(in_ouput_layer_1)in_ouput_layer_2 = w5.dot(out_hidden_3) + b5
y2_hat = softmax(in_ouput_layer_2)y1_hat # first predicted outputy1 # first true outputy2_hat # second predicted outputy2 # second true outputB_cross_E(y1, y1_hat) + mae(y2, y2_hat) # total lossB_cross_E(y1, y1_hat) # BCE lossmae(y2, y2_hat) # MAE loss
I hope now you understand how such a Neural Network works.
In the upcoming post, which will be the last of this Chapter and of simple ANNs, we will fit a Neural Network on the UCI White Wine Quality dataset. And after that, we will begin Chapter 6 — CNNs.
If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.
I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.
Many thanks for your support and feedback.
If you like this course, then you can support me at
It would mean a lot to me.
Continue to the next post — 5.9 UCI White Wine Quality dataset, Accuracy, and Confusion Matrix.