UCI White Wine Quality dataset, Accuracy & Confusion Matrix

Step by step implementation in Python

16 min readDec 26, 2021

In this last post of Chapter 5 — ANNs, we will do an exercise in which we will try to fit the UCI White Wine Quality dataset into a Neural Network.

At the end of this post, we will try to interpret the result and discuss how to improve the model.

You can download the Jupyter Notebook and the dataset from here.

Note — This post uses many things from the previous chapters. It is recommended that you have a look at the previous posts.

Back to the previous post

Back to the first post

5.9 UCI White Wine Quality dataset, Accuracy & Confusion Matrix

For this problem, we will use 3 hidden layers in the Network, i.e., 5 layers with 11, 15, 15, 15 and, 11 nodes.

But first, let us understand why we have 11 nodes in the input and the output layer.

We will follow these steps to fit the data into the network.

Step 1 - Covert the dataset into a NumPy array
Step 2 - Define the architecture of the NN for forward feed
Step 3 - Entering the training loop
         Step 3.1 - Forward feed to calculate average loss of 
                    mini-batches
         Step 3.2 - Use Backpropagation to calculate gradients
         Step 3.3 - Using Adam Optimizer to update weights and
                    biases
Step 4 - Making a Confusion Matrix for the test set

Step 1 - Covert the dataset into a NumPy array

We will first extract the dataset from a ‘CSV’ file and then convert it into a NumPy array.

import numpy as np                               # NumPy
np.random.seed(42)from numpy import genfromtxtimport pandas as pd                              # Pandas

df = pd.read_csv('winequality-white.csv', sep=';')    # dataset
df.head()

data = genfromtxt('winequality-white.csv', delimiter=';')[1:]
data                             # data into a NumPy array data.shapenp.random.shuffle(data)          # data shuffled

We can see that we have a total of 4898 samples. Looking at the data, we can say that the last variable ‘quality’ is the dependent variable and it depends on the 11 variables, i.e., fixed acidity, volatile acidity, …, alcohol. This is the reason we have 11 nodes in the input layer because we have 11 independent variables.

We have shuffled the data.

Now we have to split the dataset into the training set, validation set and, test set. Generally, the ratio is 60:20:20 or 75:15:15 but we will split the dataset like this (you may try different numbers)

training_set = data[:3600]
training_set.shape                           # training setvalidation_set = data[3600: 3600 + 900]
validation_set.shape                         # validation settest_set = data[3600 + 900:]
test_set.shape                               # test set

Now let us separate the dependent and independent variables and reshape them into 3-D tensors because the layers are 2-D tensors or vertical.

train_x = training_set[:, :-1]
train_x = train_x.reshape((3600, 11, 1))
train_xval_x = validation_set[:, :-1]
val_x = val_x.reshape((900, 11, 1))
val_xtest_x = test_set[:, :-1]
test_x = test_x.reshape((398, 11, 1))
test_x

We will convert the dependent variables into 3-D tensors after one-hot encoding.

train_y = training_set[:, -1]
train_yval_y = validation_set[:, -1]
val_ytest_y = test_set[:, -1]
test_y

What to do with the output variable, i.e., ‘quality’

You must have noticed that the quality is an integer and let us assume that the data maker had 11 quality levels from 0 to 10. 0 being poor to 10 being excellent.

We can proceed by considering it a regression problem but we will proceed to say that it is a classification problem. We have a total of 11 classes 0 to 10 and each class is the quality level.

This is the reason we have 11 nodes in the output layer.

We can convert quality level numbers into an array like this

We have a total of 11 classes so the first class ‘0 quality level’ can be written as

[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

or, we can write every integer quality level like this

This is called one-hot encoding

Now let us convert the dependent variables into 3-D tensors.

def one_hot(x):                                  # one-hot encoding
    one_hot_encod = []
    one_hot = [0] * 11
    for i in range(len(x)):
        one_hot[int(x[i])] = 1
        one_hot_encod.append(one_hot)
        one_hot = [0] * 11
    return one_hot_encod

train_y = np.array(one_hot(train_y))
train_y = train_y.reshape((3600, 11, 1))
train_yval_y = np.array(one_hot(val_y))
val_y = val_y.reshape((900, 11, 1))
val_ytest_y = np.array(one_hot(test_y))
test_y = test_y.reshape((398, 11, 1))
test_y

Step 2 — Define the architecture of the NN for forward feed

input_nodes = 11                         # Nodes in each layer
hidden_1_nodes = 15
hidden_2_nodes = 15
hidden_3_nodes = 15
output_nodes = 11

N = 3600
batch_size = 25
number_batches = int(N/batch_size)

Few things

First, the activation function used in all hidden layers is the ReLU function
Second, the activation function for the output layer is the Softmax function
Third, the loss function used in Categorical cross-entropy, CE
Fourth, we will use Adam Optimizer with a learning rate = 0.01, beta1 = 0.9, beta2 = 0.999 and, epsilon = 10**-8

def relu(x, leak = 0):                          # ReLU
    return np.where(x <= 0, leak * x, x)def relu_dash(x, leak = 0):                     # ReLU derivative
    return np.where(x <= 0, leak, 1)def softmax(x):                                 # Softmax
    return np.exp(x) / np.sum(np.exp(x))def softmax_dash(x):                            # Softmax derivative
    
    I = np.eye(x.shape[0])
    
    return softmax(x) * (I - softmax(x).T)def cross_E(y_true, y_pred):                    # CE
    return -np.sum(y_true * np.log(y_pred + 10**-100))def cross_E_grad(y_true, y_pred):               # CE derivative
    return -y_true/(y_pred + 10**-100)

In this exercise, we will generate random normal weights, i.e., between -1 and 1 with mean = 0 and standard deviation = 1

w1 = np.random.normal(size = (hidden_1_nodes, input_nodes))
w1 /= np.max(abs(w1)) * 1.01                        # w1
b1 = np.zeros(shape = (hidden_1_nodes, 1))          # b1w2 = np.random.normal(size = (hidden_2_nodes, hidden_1_nodes))
w2 /= np.max(abs(w2)) * 1.01                        # w2
b2 = np.zeros(shape = (hidden_2_nodes, 1))          # b2w3 = np.random.normal(size = (hidden_3_nodes, hidden_2_nodes))
w3 /= np.max(abs(w3)) * 1.01                        # w3
b3 = np.zeros(shape = (hidden_3_nodes, 1))          # b3w4 = np.random.normal(size = (output_nodes, hidden_3_nodes))
w4 /= np.max(abs(w4)) * 1.01                        # w4
b4 = np.zeros(shape = (output_nodes, 1))            # b4

np.random.normal doesn’t give us scalars between -1 and 1 so we divided the array with the maximum magnitude so that all numbers are now in between -1 and 1. And we also divided it by 1.01, so, now the maximum or minimum scalar now is either 0.99 or -0.99 respectively.

Now let us calculate the output for any 1 training sample.

in_hidden_1 = w1.dot(test_x[0]) + b1
out_hidden_1 = relu(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_hidden_3 = w3.dot(out_hidden_2) + b3
out_hidden_3 = relu(in_hidden_3)in_output_layer = w4.dot(out_hidden_3) + b4
y_hat = softmax(in_output_layer)y_hatnp.argmax(y_hat)                          # predicted quality leveltest_y[0]np.argmax(test_y[0])                      # true quality levelcross_E(test_y[0], y_hat)                 # CE loss

np.argmax gives the index value of the scalar whose magnitude is highest. In this case, it will give us the Quality level.

Let us initialize Adam Optimizer

learning_rate = 0.01                         # learning rate
beta1 = 0.9                                  # beta1
beta2 = 0.999                                # beta2
epsilon = 10**-8                             # epsilonmoment1_w1 = np.zeros(w1.shape)              # first moment
moment1_b1 = np.zeros(b1.shape)moment1_w2 = np.zeros(w2.shape)
moment1_b2 = np.zeros(b2.shape)moment1_w3 = np.zeros(w3.shape)
moment1_b3 = np.zeros(b3.shape)moment1_w4 = np.zeros(w4.shape)
moment1_b4 = np.zeros(b4.shape)moment2_w1 = np.zeros(w1.shape)              # second moment
moment2_b1 = np.zeros(b1.shape)moment2_w2 = np.zeros(w2.shape)
moment2_b2 = np.zeros(b2.shape)moment2_w3 = np.zeros(w3.shape)
moment2_b3 = np.zeros(b3.shape)moment2_w4 = np.zeros(w4.shape)
moment2_b4 = np.zeros(b4.shape)adam_iter = 1                                # power index for loop

train_data = np.concatenate((train_x, train_y), axis = 1)
train_data.shape

Now before entering the training loop, let us understand how we will calculate the accuracy.

We have to calculate the accuracy of the batch. We can do one thing. We can initialize a variable ‘count’. Every time np.argmax is equal for both the predicted and true arrays, we will increase it by 1. And finally, (count/batch size) will give us accuracy, i.e., correct divided by the total.

And we can take the average of the accuracies of each batch for the average accuracy of batches in the training epoch. And we will store it in the accumulator which we will use for graph plotting.

We will also store validation set accuracy for each training epoch.

Now you must be wondering what to do if we have a regression problem or multi-class label problem?

For regression, we can define a margin. Suppose 3 out of 4 scalars are within the left and right margin of true value, then the accuracy of the sample is 0.75

We can take the average of the accuracy of each sample in batch to get batch accuracy. And we can take the average of accuracies of each batch for the average accuracy of the batches. We can do the same for the multi-label classification problem by setting a threshold value saying that anything equal to or above it is 1 and others are 0.

Step 3 — Entering the training loop

epochs = 50

loss_accumulator = []                  # loss accumulator
val_loss_accumulator = []              # validation loss accumulatoracc_accumulator = []                   # accuracy accumulator
val_acc_accumulator = []               # validation accuracy
                                         accumulatorfor epoch in range(epochs):
    
    avg_loss = 0
    avg_acc = 0
    
    np.random.shuffle(train_data)      # data shuffling
    train_x = train_data[:, :11, :]
    train_y = train_data[:, 11:, :]

Now let us enter in a batch and initialize gradient accumulators and ‘count’ variable for accuracy

    for batch in range(number_batches):
        
        loss = 0
        
        sample = batch * batch_size
        
        grad_w4 = 0
        grad_b4 = 0
        
        grad_w3 = 0
        grad_b3 = 0
        
        grad_w2 = 0
        grad_b2 = 0
        
        grad_w1 = 0
        grad_b1 = 0
        
        count = 0                              # for accuracy

Now forward feed for the current sample and ‘count’

            for iteration in range(batch_size):
            
            #------------Forward propagation in batch---------------
            in_hidden_1 = w1.dot(train_x[sample]) + b1
            out_hidden_1 = relu(in_hidden_1)            in_hidden_2 = w2.dot(out_hidden_1) + b2
            out_hidden_2 = relu(in_hidden_2)            in_hidden_3 = w3.dot(out_hidden_2) + b3
            out_hidden_3 = relu(in_hidden_3)            in_output_layer = w4.dot(out_hidden_3) + b4
            y_hat = softmax(in_output_layer)
            
            #--------------------Accuracy---------------------------
            
            if np.argmax(y_hat) == np.argmax(train_y[sample]):
                count += 1

Collecting loss and gradients

            #-----------Collecting loss and gradients---------------            loss += cross_E(train_y[sample], y_hat)
            
            #---------------------------------------
            
            error_upto_softmax_H4 =
                       np.sum(cross_E_grad(train_y[sample], y_hat) *
                                     softmax_dash(in_output_layer), 
                                          axis = 0).reshape((-1, 1))
    
            grad_w4 += error_upto_softmax_H4 .dot( out_hidden_3.T )            grad_b4 += error_upto_softmax_H4            #-----------------------------------------            error_grad_H3 = np.sum(error_upto_softmax_H4 * w4, axis
                                             = 0) .reshape((-1, 1))            grad_w3 += error_grad_H3 * relu_dash(in_hidden_3) .dot(
                                                    out_hidden_2.T )            grad_b3 += error_grad_H3 * relu_dash(in_hidden_3)            #-----------------------------------------            error_grad_H2 = np.sum(error_grad_H3 *
            relu_dash(in_hidden_3) * w3, axis = 0) .reshape((-1, 1))            grad_w2 += error_grad_H2 * relu_dash(in_hidden_2) .dot(
                                                    out_hidden_1.T )            grad_b2 += error_grad_H2 * relu_dash(in_hidden_2)
            
            #-----------------------------------------
            
            error_grad_H1 = np.sum(error_grad_H2 *
            relu_dash(in_hidden_2) * w2, axis = 0) .reshape((-1, 1))            grad_w1 += error_grad_H1 * relu_dash(in_hidden_1) .dot(
                                                 train_x[sample].T )            grad_b1 += error_grad_H1 * relu_dash(in_hidden_1)
            
            sample += 1

Using Adam Optimizer to update weights and biases

        #---------Updating weights and biases with Adam-------------
        
        learning_rate_hat = learning_rate * np.sqrt(1 - beta2*
                        (adam_iter)) / (1 - beta1**(adam_iter))
        adam_iter += 1
        
        moment1_w1 = beta1 * moment1_w1 + (1 - beta1) *
                                     (grad_w1/batch_size)
        moment2_w1 = beta2 * moment2_w1 + (1 - beta2) *
                                  (grad_w1/batch_size)**2
        update_w1 = - learning_rate_hat * moment1_w1 /
                          (np.sqrt(moment2_w1) + epsilon)
        w1 += update_w1        moment1_b1 = beta1 * moment1_b1 + (1 - beta1) *
                                     (grad_b1/batch_size)
        moment2_b1 = beta2 * moment2_b1 + (1 - beta2) *
                                 (grad_b1/batch_size)**2
        update_b1 = - learning_rate_hat * moment1_b1 /
                          (np.sqrt(moment2_b1) + epsilon)
        b1 += update_b1        moment1_w2 = beta1 * moment1_w2 + (1 - beta1) *
                                    (grad_w2/batch_size)
        moment2_w2 = beta2 * moment2_w2 + (1 - beta2) *
                                 (grad_w2/batch_size)**2
        update_w2 = - learning_rate_hat * moment1_w2 /
                         (np.sqrt(moment2_w2) + epsilon)
        w2 += update_w2        moment1_b2 = beta1 * moment1_b2 + (1 - beta1) *
                                    (grad_b2/batch_size)
        moment2_b2 = beta2 * moment2_b2 + (1 - beta2) *
                                 (grad_b2/batch_size)**2
        update_b2 = - learning_rate_hat * moment1_b2 /
                         (np.sqrt(moment2_b2) + epsilon)
        b2 += update_b2        moment1_w3 = beta1 * moment1_w3 + (1 - beta1) *
                                    (grad_w3/batch_size)
        moment2_w3 = beta2 * moment2_w3 + (1 - beta2) * 
                                 (grad_w3/batch_size)**2
        update_w3 = - learning_rate_hat * moment1_w3 / 
                         (np.sqrt(moment2_w3) + epsilon)
        w3 += update_w3        moment1_b3 = beta1 * moment1_b3 + (1 - beta1) * 
                                    (grad_b3/batch_size)
        moment2_b3 = beta2 * moment2_b3 + (1 - beta2) * 
                                 (grad_b3/batch_size)**2
        update_b3 = - learning_rate_hat * moment1_b3 / 
                          (np.sqrt(moment2_b3) + epsilon)
        b3 += update_b3
        
        moment1_w4 = beta1 * moment1_w4 + (1 - beta1) * 
                                   (grad_w4/batch_size)
        moment2_w4 = beta2 * moment2_w4 + (1 - beta2) * 
                                 (grad_w4/batch_size)**2
        update_w4 = - learning_rate_hat * moment1_w4 / 
                         (np.sqrt(moment2_w4) + epsilon)
        w4 += update_w4        moment1_b4 = beta1 * moment1_b4 + (1 - beta1) * 
                                    (grad_b4/batch_size)
        moment2_b4 = beta2 * moment2_b4 + (1 - beta2) * 
                                 (grad_b4/batch_size)**2
        update_b4 = - learning_rate_hat * moment1_b4 / 
                         (np.sqrt(moment2_b4) + epsilon)
        b4 += update_b4

Printing loss and accuracy of the current batch.

        avg_loss += loss/batch_size
        avg_acc += count/batch_size
        
        print(f'average loss before training of batch number {batch
             + 1} is {loss/batch_size} -- epoch number {epoch + 1}')
        print(f'average accuracy before training of batch number 
          {batch + 1} is {count/batch_size} -- epoch number {epoch + 
                                                                1}')
        print('\n')

Printing average loss and average accuracy for batches in the current training epoch and also storing them in the list.

    print('-------------------------')    
    print(f'average loss of batches is {avg_loss/number_batches} --
                                          epoch number {epoch + 1}')
    print(f'average accuracy of batches is {avg_acc/number_batches} 
                                       -- epoch number {epoch + 1}')
    
    loss_accumulator.append(avg_loss/number_batches)
    acc_accumulator.append(avg_acc/number_batches)

Calculating loss and accuracy for the validation set for the current training epoch

    print('-------------------------')
    
    val_loss = 0
    val_count = 0
    
    for val in range(900):
        in_hidden_1 = w1.dot(val_x[val]) + b1
        out_hidden_1 = relu(in_hidden_1)    in_hidden_2 = w2.dot(out_hidden_1) + b2
        out_hidden_2 = relu(in_hidden_2)    in_hidden_3 = w3.dot(out_hidden_2) + b3
        out_hidden_3 = relu(in_hidden_3)    in_output_layer = w4.dot(out_hidden_3) + b4
        y_hat = softmax(in_output_layer)
        
        #-----------------Validation set accuracy-------------------
        
        if np.argmax(y_hat) == np.argmax(val_y[val]):
            val_count += 1
            
        #--------------------Collecting validation loss-------------    val_loss += cross_E(val_y[val], y_hat)

Printing loss and accuracy for the Validation set for the current training loop and storing them in the list.

    print(f'average loss of validation set is {val_loss/900} -- 
                                      epoch number {epoch + 1}')
    print(f'accuracy of validation set is {val_count/900} -- epoch 
                                              number {epoch + 1}')
    
    val_loss_accumulator.append(val_loss/900)
    val_acc_accumulator.append(val_count/900)
    
    print('-------------------------')
    print('\n')

The training loop will run 50 times.

This is a small screenshot after the training.

Step 4 — Making a Confusion Matrix for test set

Before going to Confusion Matrix let us perform a forward feed for any 1 test sample and plot graphs between accuracy, loss, and epochs.

in_hidden_1 = w1.dot(test_x[0]) + b1
out_hidden_1 = relu(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
out_hidden_2 = relu(in_hidden_2)in_hidden_3 = w3.dot(out_hidden_2) + b3
out_hidden_3 = relu(in_hidden_3)in_output_layer = w4.dot(out_hidden_3) + b4
y_hat = softmax(in_output_layer)y_hatnp.argmax(y_hat)                           # predicted quality leveltest_y[0]np.argmax(test_y[0])                       # true quality levelcross_E(test_y[0], y_hat)                  # CE loss

It looks like the Neural Network predicted the wrong quality level. But you must have noticed one thing, the probability for quality levels 5 and 6 are near to each other. We will explain this in a bit.

Plotting the graphs

import matplotlib.pyplot as pltplt.rcParams.update({'font.size': 22})EPOCH = [i for i in range(1, epochs + 1)]fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)ax = plt.axes()ax.plot(EPOCH, loss_accumulator, label='Training loss')
ax.plot(EPOCH, val_loss_accumulator, label='Validation loss')
ax.set_title('Training and Validation loss')
ax.legend()
plt.show()

fig = plt.figure(dpi = 100)fig.set_figheight(10.80)
fig.set_figwidth(19.20)ax = plt.axes()ax.plot(EPOCH, acc_accumulator, label='Training accuracy')
ax.plot(EPOCH, val_acc_accumulator, label='Validation accuracy')
ax.set_title('Training and Validation accuracy')
ax.legend()
plt.show()

Confusion Matrix tells us that how many true values are the same as predicted values and how many are not.

Take the above forward feed example after training. The correct label was ‘6’ but it was predicted as ‘5’. We can write this in matrix form.

Now taking the above after training forward feed example, we will have a Confusion Matrix like this.

This highlighted 1 is telling us that there is 1 sample in the test set whose true label was ‘6’ but it was predicted ‘5’

We can do it for every sample in the test set. First, we have to convert one-hot encoding to integers.

true = np.argmax(test_y, axis = 1).reshape((-1,))
true

pred = []
for test in test_x:
    in_hidden_1 = w1.dot(test) + b1
    out_hidden_1 = relu(in_hidden_1)in_hidden_2 = w2.dot(out_hidden_1) + b2
    out_hidden_2 = relu(in_hidden_2)in_hidden_3 = w3.dot(out_hidden_2) + b3
    out_hidden_3 = relu(in_hidden_3)in_output_layer = w4.dot(out_hidden_3) + b4
    y_hat = softmax(in_output_layer)
    
    pred.append(np.argmax(y_hat))pred = np.array(pred)
pred

cm = [[999, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]]
for i in range(11):
    cm.append([i, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,])
np.array(cm)

Note — Ignore ‘999’

And we will have this Confusion Matrix

for i in range(len(true)):
    cm[true[i] + 1][pred[i] + 1] += 1
    
np.array(cm)

We can calculate accuracy by dividing the sum of diagonals by the total number of samples in the test set

(51+116+38) * 100 / 398

There are some terms like Precision, Recall, and many more. You may refer to the literature available on the internet.

Now you must be asking why the non-zero numbers in the Confusion matrix are clustered together at 5, 6, and 7?

Because the dataset is flawed.

This dataset is a perfect example of ‘Class Imbalance’, i.e., there are far more samples of a few classes than the others.

If you plot a histogram of the last column of array ‘data’

from matplotlib import pyplot as pltfig, ax = plt.subplots(figsize =(10, 7))
ax.hist(data[:, -1], bins = 7)
 
plt.show()

We can see that approximately 3,700 samples are from quality levels 5 and 6 out of 4898 and approximately 900 samples are from quality level 7.

This justifies why we had probabilities of quality levels 5 and 6 the highest and near to each other for the test sample after the training. And we can also see that majority of wrong predictions are actually 5s and 6s. Because the NN only knows about quality levels 5 and 6 more compared to other quality levels.

We must ask ourselves, is the Deep Learning approach good for such a dataset? A good dataset is an answer to any DL/ML problem.

You will find other ML technique solutions very easily on the internet.

Can we improve the accuracy by Regularization techniques or by generating copies of samples for those classes which are less in the dataset to balance the classes or by using a different number of layers and nodes? Give it a try.

I wanted to use tabular data for ANNs tutorial and not image data. Because many tutorials directly take MNIST Handwriting dataset and I want you to first know what exactly are images which we will see in the next chapter.

With this post, chapter 5 is finished. We will be starting Chapter 6 — CNNs with the next post. If you like ANNs then you will love CNNs.

If you like this post then please subscribe to my youtube channel neuralthreads and join me on Reddit.

I will be uploading new interactive videos soon on the youtube channel. And I will be happy to help you with any doubt on Reddit.

Many thanks for your support and feedback.

If you like this course, then you can support me at

It would mean a lot to me.

Continue to the next post (I will attach the link to the next post asap when it is ready) 6.1 Images as NumPy arrays.

UCI White Wine Quality dataset, Accuracy & Confusion Matrix

Step by step implementation in Python

5.9 UCI White Wine Quality dataset, Accuracy & Confusion Matrix

Many thanks for your support and feedback.

Written by neuralthreads